2025-05-07T20:23:19.6967074Z Current runner version: '2.323.0'
2025-05-07T20:23:19.6973897Z Runner name: 'i-04dd41b83603cbddd'
2025-05-07T20:23:19.6974849Z Machine name: 'ip-10-0-8-106'
2025-05-07T20:23:19.6977590Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:19.6979882Z Contents: read
2025-05-07T20:23:19.6980415Z Metadata: read
2025-05-07T20:23:19.6980915Z Packages: read
2025-05-07T20:23:19.6981414Z ##[endgroup]
2025-05-07T20:23:19.6983299Z Secret source: None
2025-05-07T20:23:19.6983942Z Prepare workflow directory
2025-05-07T20:23:19.7898929Z Prepare all required actions
2025-05-07T20:23:19.7938127Z Getting action download info
2025-05-07T20:23:20.0004139Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:20.3009944Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:20.7309895Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:22.4442951Z Getting action download info
2025-05-07T20:23:22.5667867Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:22.8530935Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.8.0, 12.6.3, gcc)
2025-05-07T20:23:22.9136851Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:22.9272823Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:22.9285758Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:22.9287306Z ##[endgroup]
2025-05-07T20:23:24.7938367Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:24.7938813Z Instance Type: g5.4xlarge
2025-05-07T20:23:24.7939066Z AMI Name: unknown
2025-05-07T20:23:24.7974161Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:30.1909220Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:30.1909540Z with:
2025-05-07T20:23:30.1909766Z   submodules: true
2025-05-07T20:23:30.1910012Z   repository: pytorch/FBGEMM
2025-05-07T20:23:30.1910408Z   token: ***
2025-05-07T20:23:30.1910617Z   ssh-strict: true
2025-05-07T20:23:30.1910836Z   ssh-user: git
2025-05-07T20:23:30.1911062Z   persist-credentials: true
2025-05-07T20:23:30.1911320Z   clean: true
2025-05-07T20:23:30.1911552Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:30.1911828Z   fetch-depth: 1
2025-05-07T20:23:30.1912051Z   fetch-tags: false
2025-05-07T20:23:30.1912273Z   show-progress: true
2025-05-07T20:23:30.1912501Z   lfs: false
2025-05-07T20:23:30.1912713Z   set-safe-directory: true
2025-05-07T20:23:30.1912978Z env:
2025-05-07T20:23:30.1913196Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:30.1913843Z   BUILD_ENV: build_binary
2025-05-07T20:23:30.1914113Z   BUILD_TARGET: genai
2025-05-07T20:23:30.1914350Z   BUILD_VARIANT: cuda
2025-05-07T20:23:30.1914618Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:30.1914877Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:30.1915119Z ##[endgroup]
2025-05-07T20:23:30.3088420Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:30.3090630Z ##[group]Getting Git version info
2025-05-07T20:23:30.3091663Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:30.3092840Z [command]/usr/bin/git version
2025-05-07T20:23:30.3093401Z git version 2.47.1
2025-05-07T20:23:30.3109848Z ##[endgroup]
2025-05-07T20:23:30.3123090Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/98d045e0-d391-4420-aa0d-7228e750a89f/.gitconfig'
2025-05-07T20:23:30.3145694Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/98d045e0-d391-4420-aa0d-7228e750a89f' before making global git config changes
2025-05-07T20:23:30.3147433Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:30.3151653Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:30.3196826Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:30.3221419Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:30.3239323Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:30.3243108Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:30.3269162Z refs/heads/main
2025-05-07T20:23:30.3279076Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:31.1952424Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:31.1999863Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:31.2027445Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:31.2033757Z ##[endgroup]
2025-05-07T20:23:31.2036800Z [command]/usr/bin/git submodule status
2025-05-07T20:23:31.2453382Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:31.2536868Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:31.2622549Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:31.2711969Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:31.2800068Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:31.2885847Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:31.2967955Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:31.2982066Z ##[group]Cleaning the repository
2025-05-07T20:23:31.2987204Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:31.3045917Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:31.3157410Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:31.3164720Z ##[endgroup]
2025-05-07T20:23:31.3166728Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:31.3171498Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:31.3203132Z ##[endgroup]
2025-05-07T20:23:31.3203539Z ##[group]Setting up auth
2025-05-07T20:23:31.3208731Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:31.3251078Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:31.3580245Z Entering 'external/asmjit'
2025-05-07T20:23:31.3646920Z Entering 'external/composable_kernel'
2025-05-07T20:23:31.3721280Z Entering 'external/cpuinfo'
2025-05-07T20:23:31.3787622Z Entering 'external/cutlass'
2025-05-07T20:23:31.3862140Z Entering 'external/googletest'
2025-05-07T20:23:31.3926470Z Entering 'external/hipify_torch'
2025-05-07T20:23:31.3992884Z Entering 'external/json'
2025-05-07T20:23:31.4078076Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:31.4110256Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:31.4439254Z Entering 'external/asmjit'
2025-05-07T20:23:31.4503494Z Entering 'external/composable_kernel'
2025-05-07T20:23:31.4577075Z Entering 'external/cpuinfo'
2025-05-07T20:23:31.4644947Z Entering 'external/cutlass'
2025-05-07T20:23:31.4721354Z Entering 'external/googletest'
2025-05-07T20:23:31.4787401Z Entering 'external/hipify_torch'
2025-05-07T20:23:31.4853427Z Entering 'external/json'
2025-05-07T20:23:31.4940056Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:31.4992159Z ##[endgroup]
2025-05-07T20:23:31.4992576Z ##[group]Fetching the repository
2025-05-07T20:23:31.4999800Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:31.6983257Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:31.6984221Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:31.7009743Z ##[endgroup]
2025-05-07T20:23:31.7010336Z ##[group]Determining the checkout info
2025-05-07T20:23:31.7012173Z ##[endgroup]
2025-05-07T20:23:31.7016774Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:31.7069262Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:31.7098449Z ##[group]Checking out the ref
2025-05-07T20:23:31.7102718Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:31.7231912Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:31.7235551Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:31.7245123Z ##[endgroup]
2025-05-07T20:23:31.7245697Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:31.7250838Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:31.7300750Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:31.7332815Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:31.7364319Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:31.7393822Z ##[endgroup]
2025-05-07T20:23:31.7394376Z ##[group]Fetching submodules
2025-05-07T20:23:31.7396669Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:31.7770479Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:31.7771119Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:31.7771945Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:31.7772362Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:31.7772774Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:31.7773195Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:31.7773601Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:31.7786788Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:31.8210729Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:31.8355629Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:31.8454122Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:31.8619275Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:31.8706406Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:31.8787682Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:31.8885217Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:31.8902522Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:31.9235958Z Entering 'external/asmjit'
2025-05-07T20:23:31.9268855Z Entering 'external/composable_kernel'
2025-05-07T20:23:31.9301308Z Entering 'external/cpuinfo'
2025-05-07T20:23:31.9333995Z Entering 'external/cutlass'
2025-05-07T20:23:31.9365982Z Entering 'external/googletest'
2025-05-07T20:23:31.9398074Z Entering 'external/hipify_torch'
2025-05-07T20:23:31.9430359Z Entering 'external/json'
2025-05-07T20:23:31.9475600Z ##[endgroup]
2025-05-07T20:23:31.9476038Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:31.9481638Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:31.9812050Z Entering 'external/asmjit'
2025-05-07T20:23:31.9854367Z url.https://github.com/.insteadof
2025-05-07T20:23:31.9854844Z url.https://github.com/.insteadof
2025-05-07T20:23:31.9896912Z Entering 'external/composable_kernel'
2025-05-07T20:23:31.9939703Z url.https://github.com/.insteadof
2025-05-07T20:23:31.9940053Z url.https://github.com/.insteadof
2025-05-07T20:23:31.9990226Z Entering 'external/cpuinfo'
2025-05-07T20:23:32.0036860Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0037190Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0080564Z Entering 'external/cutlass'
2025-05-07T20:23:32.0123462Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0123796Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0174436Z Entering 'external/googletest'
2025-05-07T20:23:32.0217368Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0217732Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0260174Z Entering 'external/hipify_torch'
2025-05-07T20:23:32.0302511Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0302846Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0343681Z Entering 'external/json'
2025-05-07T20:23:32.0385301Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0385642Z url.https://github.com/.insteadof
2025-05-07T20:23:32.0445860Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:32.0775243Z Entering 'external/asmjit'
2025-05-07T20:23:32.0837103Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:32.0840461Z Entering 'external/composable_kernel'
2025-05-07T20:23:32.0900638Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:32.0903513Z Entering 'external/cpuinfo'
2025-05-07T20:23:32.0964652Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:32.0968007Z Entering 'external/cutlass'
2025-05-07T20:23:32.1029263Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:32.1032328Z Entering 'external/googletest'
2025-05-07T20:23:32.1093284Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:32.1096578Z Entering 'external/hipify_torch'
2025-05-07T20:23:32.1156495Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:32.1159678Z Entering 'external/json'
2025-05-07T20:23:32.1219398Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:32.1340257Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:32.1670882Z Entering 'external/asmjit'
2025-05-07T20:23:32.1704523Z Entering 'external/composable_kernel'
2025-05-07T20:23:32.1736942Z Entering 'external/cpuinfo'
2025-05-07T20:23:32.1768642Z Entering 'external/cutlass'
2025-05-07T20:23:32.1800544Z Entering 'external/googletest'
2025-05-07T20:23:32.1832256Z Entering 'external/hipify_torch'
2025-05-07T20:23:32.1863951Z Entering 'external/json'
2025-05-07T20:23:32.1917819Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:32.2251700Z Entering 'external/asmjit'
2025-05-07T20:23:32.2284590Z Entering 'external/composable_kernel'
2025-05-07T20:23:32.2317498Z Entering 'external/cpuinfo'
2025-05-07T20:23:32.2349472Z Entering 'external/cutlass'
2025-05-07T20:23:32.2381753Z Entering 'external/googletest'
2025-05-07T20:23:32.2412765Z Entering 'external/hipify_torch'
2025-05-07T20:23:32.2444665Z Entering 'external/json'
2025-05-07T20:23:32.2487683Z ##[endgroup]
2025-05-07T20:23:32.2528765Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:32.2555319Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:32.2743996Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:32.2744324Z with:
2025-05-07T20:23:32.2744575Z   name: fbgemm_genai_x86_gcc_py3.13_cu12.8.0.whl
2025-05-07T20:23:32.2744900Z   merge-multiple: false
2025-05-07T20:23:32.2745168Z   repository: pytorch/FBGEMM
2025-05-07T20:23:32.2745438Z   run-id: 14891846252
2025-05-07T20:23:32.2745656Z env:
2025-05-07T20:23:32.2745884Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:32.2746191Z   BUILD_ENV: build_binary
2025-05-07T20:23:32.2746446Z   BUILD_TARGET: genai
2025-05-07T20:23:32.2746677Z   BUILD_VARIANT: cuda
2025-05-07T20:23:32.2746924Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:32.2747187Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:32.2747433Z ##[endgroup]
2025-05-07T20:23:32.5091706Z Downloading single artifact
2025-05-07T20:23:32.6010576Z Preparing to download the following artifacts:
2025-05-07T20:23:32.6011564Z - fbgemm_genai_x86_gcc_py3.13_cu12.8.0.whl (ID: 3081398569, Size: 18508688, Expected Digest: sha256:0316113a2b3fde93fffa97b955c92dd5eef475455a84550f9225df12df45620e)
2025-05-07T20:23:32.6610773Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-9c0298fb-5696-52b2-b592-faf612d983f7/artifacts/e48f0a27a297b17d0606bf1cfc4cb07571f0d4bdb9bce51dcfb63b95a2571c5a.zip
2025-05-07T20:23:32.6613147Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:32.7545488Z (node:197886) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:32.7547101Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:33.0333591Z SHA256 digest of downloaded artifact is 0316113a2b3fde93fffa97b955c92dd5eef475455a84550f9225df12df45620e
2025-05-07T20:23:33.0334236Z Artifact download completed successfully.
2025-05-07T20:23:33.0334581Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:33.0340610Z Download artifact has finished successfully
2025-05-07T20:23:33.0594604Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:33.0595030Z with:
2025-05-07T20:23:33.0595256Z   driver-version: 570.133.07
2025-05-07T20:23:33.0595517Z env:
2025-05-07T20:23:33.0595751Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:33.0596072Z   BUILD_ENV: build_binary
2025-05-07T20:23:33.0596331Z   BUILD_TARGET: genai
2025-05-07T20:23:33.0596569Z   BUILD_VARIANT: cuda
2025-05-07T20:23:33.0596818Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:33.0597091Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:33.0597335Z ##[endgroup]
2025-05-07T20:23:33.0692658Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:33.0693066Z with:
2025-05-07T20:23:33.0693297Z   timeout_minutes: 10
2025-05-07T20:23:33.0693538Z   max_attempts: 3
2025-05-07T20:23:33.0717504Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:33.0741654Z   retry_wait_seconds: 10
2025-05-07T20:23:33.0741924Z   polling_interval_seconds: 1
2025-05-07T20:23:33.0742198Z   warning_on_retry: true
2025-05-07T20:23:33.0742457Z   continue_on_error: false
2025-05-07T20:23:33.0742713Z env:
2025-05-07T20:23:33.0742937Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:33.0743255Z   BUILD_ENV: build_binary
2025-05-07T20:23:33.0762258Z   BUILD_TARGET: genai
2025-05-07T20:23:33.0762534Z   BUILD_VARIANT: cuda
2025-05-07T20:23:33.0762780Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:33.0763050Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:33.0763296Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:33.0763539Z ##[endgroup]
2025-05-07T20:23:33.9599979Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:33.9600682Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:33.9603258Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:34.2576835Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:34.2577449Z No packages marked for removal.
2025-05-07T20:23:34.2647510Z Dependencies resolved.
2025-05-07T20:23:34.2658010Z Nothing to do.
2025-05-07T20:23:34.2658385Z Complete!
2025-05-07T20:23:34.3010205Z + install_nvidia_driver_common
2025-05-07T20:23:34.3014297Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:34.3014610Z + lspci
2025-05-07T20:23:34.3016308Z Before installing NVIDIA driver
2025-05-07T20:23:34.3134711Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:34.3135765Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:34.3136347Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:34.3136879Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:34.3137368Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:34.3137903Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:34.3138389Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:34.3138915Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:34.3139346Z + lsmod
2025-05-07T20:23:34.3183027Z Module                  Size  Used by
2025-05-07T20:23:34.3183634Z xt_nat                 16384  0
2025-05-07T20:23:34.3184157Z nvidia_modeset       1716224  0
2025-05-07T20:23:34.3184726Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:34.3185337Z wmi                    36864  1 video
2025-05-07T20:23:34.3185886Z nvidia_uvm           1884160  0
2025-05-07T20:23:34.3186499Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:34.3187153Z drm                   602112  1 nvidia
2025-05-07T20:23:34.3187763Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:34.3188499Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:34.3188982Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:34.3189295Z veth                   36864  0
2025-05-07T20:23:34.3189562Z xt_conntrack           16384  1
2025-05-07T20:23:34.3189827Z nft_chain_nat          16384  3
2025-05-07T20:23:34.3190089Z xt_MASQUERADE          20480  1
2025-05-07T20:23:34.3190794Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:34.3191151Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:34.3191592Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:34.3192059Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:34.3192382Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:34.3192684Z xfrm_user              57344  1
2025-05-07T20:23:34.3192953Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:34.3193254Z xt_addrtype            16384  2
2025-05-07T20:23:34.3193524Z nft_compat             20480  4
2025-05-07T20:23:34.3193835Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:34.3194260Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:34.3194641Z br_netfilter           36864  0
2025-05-07T20:23:34.3194922Z bridge                323584  1 br_netfilter
2025-05-07T20:23:34.3195234Z stp                    16384  1 bridge
2025-05-07T20:23:34.3195526Z llc                    16384  2 bridge,stp
2025-05-07T20:23:34.3195818Z overlay               167936  0
2025-05-07T20:23:34.3196070Z tls                   135168  0
2025-05-07T20:23:34.3196333Z nls_ascii              16384  1
2025-05-07T20:23:34.3196592Z nls_cp437              20480  1
2025-05-07T20:23:34.3196841Z vfat                   24576  1
2025-05-07T20:23:34.3197099Z fat                    86016  1 vfat
2025-05-07T20:23:34.3197370Z ena                   180224  0
2025-05-07T20:23:34.3197615Z i8042                  45056  0
2025-05-07T20:23:34.3197874Z serio                  28672  3 i8042
2025-05-07T20:23:34.3198161Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:34.3198429Z button                 24576  0
2025-05-07T20:23:34.3198685Z sunrpc                696320  1
2025-05-07T20:23:34.3198948Z sch_fq_codel           20480  17
2025-05-07T20:23:34.3199206Z dm_mod                188416  0
2025-05-07T20:23:34.3199473Z dax                    45056  1 dm_mod
2025-05-07T20:23:34.3199760Z fuse                  163840  1
2025-05-07T20:23:34.3200014Z loop                   36864  0
2025-05-07T20:23:34.3200459Z configfs               57344  1
2025-05-07T20:23:34.3200746Z dmi_sysfs              20480  0
2025-05-07T20:23:34.3201114Z crc32_pclmul           16384  0
2025-05-07T20:23:34.3201378Z crc32c_intel           24576  0
2025-05-07T20:23:34.3201633Z efivarfs               24576  1
2025-05-07T20:23:34.3201906Z + modinfo nvidia
2025-05-07T20:23:34.3202304Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:34.3202758Z import_ns:      DMA_BUF
2025-05-07T20:23:34.3203012Z alias:          char-major-195-*
2025-05-07T20:23:34.3203292Z version:        570.133.07
2025-05-07T20:23:34.3203552Z supported:      external
2025-05-07T20:23:34.3203803Z license:        Dual MIT/GPL
2025-05-07T20:23:34.3204100Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:34.3204458Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:34.3204784Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:34.3205123Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:34.3205474Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:34.3205818Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:34.3206143Z depends:        i2c-core,drm
2025-05-07T20:23:34.3206412Z retpoline:      Y
2025-05-07T20:23:34.3206631Z name:           nvidia
2025-05-07T20:23:34.3207004Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:34.3207491Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:34.3207955Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:34.3208379Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:34.3208701Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:34.3209059Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:34.3209481Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:34.3209800Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:34.3210113Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:34.3210485Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:34.3210883Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:34.3211226Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:34.3211537Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:34.3211849Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:34.3212221Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:34.3212631Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:34.3213016Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:34.3213617Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:34.3214042Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:34.3214481Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:34.3214906Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:34.3215258Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:34.3215653Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:34.3216037Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:34.3216395Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:34.3216733Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:34.3217071Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:34.3217410Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:34.3217736Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:34.3218090Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:34.3218470Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:34.3218814Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:34.3219163Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:34.3219521Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:34.3219875Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:34.3220231Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:34.3220772Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:34.3221080Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:34.3221421Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:34.3221754Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:34.3222085Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:34.3222431Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:34.3222795Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:34.3223160Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:34.3223499Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:34.3223875Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:34.3224236Z parm:           rm_firmware_active:charp
2025-05-07T20:23:34.3224547Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:34.3224794Z ++ command -v nvidia-smi
2025-05-07T20:23:34.3225066Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:34.3225336Z + set +e
2025-05-07T20:23:34.3225662Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:34.3449818Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:34.3450137Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:34.3450386Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:34.3450609Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:34.3450888Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:34.3451329Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:34.3451808Z + set -e
2025-05-07T20:23:34.3452015Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:34.3452416Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:34.3453246Z + post_install_nvidia_driver_common
2025-05-07T20:23:34.3457360Z + sudo modprobe nvidia
2025-05-07T20:23:34.4614168Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:34.4614503Z + lspci
2025-05-07T20:23:34.4614732Z After installing NVIDIA driver
2025-05-07T20:23:34.4727143Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:34.4727754Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:34.4728321Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:34.4728858Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:34.4729350Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:34.4729891Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:34.4730394Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:34.4730883Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:34.4731309Z + lsmod
2025-05-07T20:23:34.4759620Z Module                  Size  Used by
2025-05-07T20:23:34.4759920Z xt_nat                 16384  0
2025-05-07T20:23:34.4760273Z nvidia_modeset       1716224  0
2025-05-07T20:23:34.4760575Z video                  65536  1 nvidia_modeset
2025-05-07T20:23:34.4760888Z wmi                    36864  1 video
2025-05-07T20:23:34.4761163Z nvidia_uvm           1884160  0
2025-05-07T20:23:34.4761477Z nvidia              11583488  7 nvidia_uvm,nvidia_modeset
2025-05-07T20:23:34.4761815Z drm                   602112  1 nvidia
2025-05-07T20:23:34.4762126Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:34.4762521Z backlight              24576  3 video,drm,nvidia_modeset
2025-05-07T20:23:34.4762883Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:34.4763182Z veth                   36864  0
2025-05-07T20:23:34.4763447Z xt_conntrack           16384  1
2025-05-07T20:23:34.4763716Z nft_chain_nat          16384  3
2025-05-07T20:23:34.4763992Z xt_MASQUERADE          20480  1
2025-05-07T20:23:34.4764309Z nf_nat                 57344  3 xt_nat,nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:34.4764669Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:34.4765392Z nf_conntrack          184320  5 xt_conntrack,nf_nat,xt_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:34.4765876Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:34.4766197Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:34.4766506Z xfrm_user              57344  1
2025-05-07T20:23:34.4766792Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:34.4767090Z xt_addrtype            16384  2
2025-05-07T20:23:34.4767363Z nft_compat             20480  4
2025-05-07T20:23:34.4767682Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:34.4768111Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:34.4768505Z br_netfilter           36864  0
2025-05-07T20:23:34.4768799Z bridge                323584  1 br_netfilter
2025-05-07T20:23:34.4769110Z stp                    16384  1 bridge
2025-05-07T20:23:34.4769406Z llc                    16384  2 bridge,stp
2025-05-07T20:23:34.4769710Z overlay               167936  0
2025-05-07T20:23:34.4769985Z tls                   135168  0
2025-05-07T20:23:34.4770245Z nls_ascii              16384  1
2025-05-07T20:23:34.4770515Z nls_cp437              20480  1
2025-05-07T20:23:34.4770778Z vfat                   24576  1
2025-05-07T20:23:34.4771040Z fat                    86016  1 vfat
2025-05-07T20:23:34.4771320Z ena                   180224  0
2025-05-07T20:23:34.4771582Z i8042                  45056  0
2025-05-07T20:23:34.4771845Z serio                  28672  3 i8042
2025-05-07T20:23:34.4772140Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:34.4772415Z button                 24576  0
2025-05-07T20:23:34.4772671Z sunrpc                696320  1
2025-05-07T20:23:34.4772936Z sch_fq_codel           20480  17
2025-05-07T20:23:34.4773211Z dm_mod                188416  0
2025-05-07T20:23:34.4773628Z dax                    45056  1 dm_mod
2025-05-07T20:23:34.4773907Z fuse                  163840  1
2025-05-07T20:23:34.4774168Z loop                   36864  0
2025-05-07T20:23:34.4774432Z configfs               57344  1
2025-05-07T20:23:34.4774693Z dmi_sysfs              20480  0
2025-05-07T20:23:34.4774956Z crc32_pclmul           16384  0
2025-05-07T20:23:34.4775222Z crc32c_intel           24576  0
2025-05-07T20:23:34.4775481Z efivarfs               24576  1
2025-05-07T20:23:34.4775738Z + modinfo nvidia
2025-05-07T20:23:34.4778108Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:34.4778594Z import_ns:      DMA_BUF
2025-05-07T20:23:34.4778875Z alias:          char-major-195-*
2025-05-07T20:23:34.4779177Z version:        570.133.07
2025-05-07T20:23:34.4779436Z supported:      external
2025-05-07T20:23:34.4779692Z license:        Dual MIT/GPL
2025-05-07T20:23:34.4779995Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:34.4780356Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:34.4780681Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:34.4781013Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:34.4781368Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:34.4781718Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:34.4782043Z depends:        i2c-core,drm
2025-05-07T20:23:34.4782310Z retpoline:      Y
2025-05-07T20:23:34.4782536Z name:           nvidia
2025-05-07T20:23:34.4782905Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:34.4783399Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:34.4783869Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:34.4784298Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:34.4784624Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:34.4784946Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:34.4785268Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:34.4785589Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:34.4785914Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:34.4786398Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:34.4786797Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:34.4787146Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:34.4787461Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:34.4787775Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:34.4788154Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:34.4788565Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:34.4788954Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:34.4789381Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:34.4789802Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:34.4790240Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:34.4790663Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:34.4791016Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:34.4791400Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:34.4791785Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:34.4792138Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:34.4792475Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:34.4792813Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:34.4793149Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:34.4793476Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:34.4793842Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:34.4794210Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:34.4794552Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:34.4795000Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:34.4795351Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:34.4795705Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:34.4796063Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:34.4796401Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:34.4796705Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:34.4797043Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:34.4797377Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:34.4797706Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:34.4798049Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:34.4798420Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:34.4798806Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:34.4799171Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:34.4799531Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:34.4799885Z parm:           rm_firmware_active:charp
2025-05-07T20:23:34.4800288Z + set +e
2025-05-07T20:23:34.4800495Z + nvidia-smi
2025-05-07T20:23:34.4953620Z Wed May  7 20:23:34 2025       
2025-05-07T20:23:34.4954007Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:34.4954519Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:34.4955023Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:34.4955534Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:34.4956084Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:34.4956526Z |                                         |                        |               MIG M. |
2025-05-07T20:23:34.4956879Z |=========================================+========================+======================|
2025-05-07T20:23:34.5125282Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:34.5125933Z |  0%   29C    P8             26W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:34.5126338Z |                                         |                        |                  N/A |
2025-05-07T20:23:34.5126751Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:34.5128863Z                                                                                          
2025-05-07T20:23:34.5129283Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:34.5129724Z | Processes:                                                                              |
2025-05-07T20:23:34.5130185Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:34.5130620Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:34.5130980Z |=========================================================================================|
2025-05-07T20:23:34.5132781Z |  No running processes found                                                             |
2025-05-07T20:23:34.5133274Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:34.7488861Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:34.7652307Z NVIDIA A10G
2025-05-07T20:23:34.7698437Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:34.7698692Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:34.7698933Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:34.7699229Z + set -e
2025-05-07T20:23:34.7699449Z INFO: Ignoring allowed status 0
2025-05-07T20:23:34.7705811Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:34.7709382Z + sudo yum install -y yum-utils
2025-05-07T20:23:35.1894454Z Last metadata expiration check: 0:54:02 ago on Wed May  7 19:29:33 2025.
2025-05-07T20:23:35.2139993Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:35.2538847Z Dependencies resolved.
2025-05-07T20:23:35.2720924Z Nothing to do.
2025-05-07T20:23:35.2721160Z Complete!
2025-05-07T20:23:35.3116509Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:35.3117088Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:35.3117960Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:35.5921228Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:35.6491867Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:36.1888307Z nvidia-container-toolkit                         12 kB/s | 833  B     00:00    
2025-05-07T20:23:36.2136622Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:36.2535968Z Dependencies resolved.
2025-05-07T20:23:36.2713876Z ================================================================================
2025-05-07T20:23:36.2714787Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:36.2715568Z ================================================================================
2025-05-07T20:23:36.2716182Z Downgrading:
2025-05-07T20:23:36.2716931Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:36.2718132Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:36.2718849Z 
2025-05-07T20:23:36.2719047Z Transaction Summary
2025-05-07T20:23:36.2719425Z ================================================================================
2025-05-07T20:23:36.2719837Z Downgrade  2 Packages
2025-05-07T20:23:36.2720063Z 
2025-05-07T20:23:36.2720282Z Total download size: 6.8 M
2025-05-07T20:23:36.2720662Z Downloading Packages:
2025-05-07T20:23:36.3115276Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  32 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:36.3800584Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  52 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:36.3812593Z --------------------------------------------------------------------------------
2025-05-07T20:23:36.3815965Z Total                                            62 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:36.3818391Z Running transaction check
2025-05-07T20:23:36.3919673Z Transaction check succeeded.
2025-05-07T20:23:36.3919966Z Running transaction test
2025-05-07T20:23:36.4212753Z Transaction test succeeded.
2025-05-07T20:23:36.4215650Z Running transaction
2025-05-07T20:23:36.9735612Z   Preparing        :                                                        1/1 
2025-05-07T20:23:37.0789397Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:37.0808207Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:37.1020072Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:37.1020657Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:37.1124422Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:37.1147993Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:37.2893902Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:37.2894495Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:37.2895046Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:37.2895592Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:37.4340179Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:37.4340837Z WARNING:
2025-05-07T20:23:37.4341181Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:37.4341441Z 
2025-05-07T20:23:37.4341538Z   Available Versions:
2025-05-07T20:23:37.4341706Z 
2025-05-07T20:23:37.4341799Z   Version 2023.7.20250331:
2025-05-07T20:23:37.4342134Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:37.4342399Z 
2025-05-07T20:23:37.4342533Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:37.4342755Z 
2025-05-07T20:23:37.4342845Z     Release notes:
2025-05-07T20:23:37.4343278Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:37.4343666Z 
2025-05-07T20:23:37.4343766Z   Version 2023.7.20250414:
2025-05-07T20:23:37.4344087Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:37.4344369Z 
2025-05-07T20:23:37.4344491Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:37.4344716Z 
2025-05-07T20:23:37.4344805Z     Release notes:
2025-05-07T20:23:37.4345228Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:37.4345611Z 
2025-05-07T20:23:37.4345705Z   Version 2023.7.20250428:
2025-05-07T20:23:37.4346029Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:37.4346296Z 
2025-05-07T20:23:37.4346420Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:37.4346638Z 
2025-05-07T20:23:37.4346732Z     Release notes:
2025-05-07T20:23:37.4347140Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:37.4347528Z 
2025-05-07T20:23:37.4347647Z ================================================================================
2025-05-07T20:23:37.4697988Z  
2025-05-07T20:23:37.4698140Z 
2025-05-07T20:23:37.4698235Z Downgraded:
2025-05-07T20:23:37.4698630Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:37.4699220Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:37.4699570Z 
2025-05-07T20:23:37.4699662Z Complete!
2025-05-07T20:23:37.5184624Z + sudo systemctl restart docker
2025-05-07T20:23:40.5879570Z nvidia-persistenced failed to initialize. Check syslog for more details.
2025-05-07T20:23:40.6075029Z Wed May  7 20:23:40 2025       
2025-05-07T20:23:40.6075420Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:40.6075945Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:40.6076448Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:40.6076960Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:40.6077498Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:40.6077970Z |                                         |                        |               MIG M. |
2025-05-07T20:23:40.6078334Z |=========================================+========================+======================|
2025-05-07T20:23:40.6207853Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:40.6208314Z |  0%   30C    P8             26W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:40.6208709Z |                                         |                        |                  N/A |
2025-05-07T20:23:40.6209118Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:40.6212162Z                                                                                          
2025-05-07T20:23:40.6212578Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:40.6213749Z | Processes:                                                                              |
2025-05-07T20:23:40.6214208Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:40.6214639Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:40.6214991Z |=========================================================================================|
2025-05-07T20:23:40.6217596Z |  No running processes found                                                             |
2025-05-07T20:23:40.6218086Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:41.1270235Z Command completed after 1 attempt(s).
2025-05-07T20:23:41.1357535Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:41.1358012Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:41.1372040Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:41.1372420Z env:
2025-05-07T20:23:41.1372659Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:41.1372969Z   BUILD_ENV: build_binary
2025-05-07T20:23:41.1373228Z   BUILD_TARGET: genai
2025-05-07T20:23:41.1373472Z   BUILD_VARIANT: cuda
2025-05-07T20:23:41.1373714Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:41.1373984Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:41.1374303Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:41.1374648Z ##[endgroup]
2025-05-07T20:23:41.4726716Z ################################################################################
2025-05-07T20:23:41.4727109Z # Print System Info
2025-05-07T20:23:41.4727339Z #
2025-05-07T20:23:41.4741488Z # [2025-05-07T20:23:41.473Z] + print_system_info 
2025-05-07T20:23:41.4741867Z ################################################################################
2025-05-07T20:23:41.4742102Z 
2025-05-07T20:23:41.4742217Z ################################################################################
2025-05-07T20:23:41.4742576Z [INFO] Printing environment variables ...
2025-05-07T20:23:41.4742874Z + printenv
2025-05-07T20:23:41.4743000Z 
2025-05-07T20:23:41.4751695Z SHELL=/bin/bash
2025-05-07T20:23:41.4752051Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:41.4752463Z BUILD_VARIANT=cuda
2025-05-07T20:23:41.4753099Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_0aae4d04-12f7-4bb7-9b39-789ed9ac7062
2025-05-07T20:23:41.4753772Z GITHUB_ACTION=__run
2025-05-07T20:23:41.4754070Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:41.4754421Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:41.4754675Z RUNNER_NAME=i-04dd41b83603cbddd
2025-05-07T20:23:41.4754973Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:41.4755286Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:41.4755555Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:41.4755934Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:41.4756380Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:41.4756660Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:41.4756963Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:41.4757575Z ***
2025-05-07T20:23:41.4757776Z LOGNAME=ec2-user
2025-05-07T20:23:41.4758010Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:41.4758279Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:41.4758516Z GITHUB_ACTIONS=true
2025-05-07T20:23:41.4758741Z SYSTEMD_EXEC_PID=55419
2025-05-07T20:23:41.4759029Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:41.4759591Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:41.4760233Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:41.4760526Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:41.4760794Z RUNNER_OS=Linux
2025-05-07T20:23:41.4761019Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:41.4761276Z HOME=/home/ec2-user
2025-05-07T20:23:41.4761536Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:41.4762174Z LANG=C.UTF-8
2025-05-07T20:23:41.4762471Z RUNNER_TRACKING_ID=github_d537b2d4-b72f-4240-a0b6-544aab4d7466
2025-05-07T20:23:41.4762841Z RUNNER_ARCH=X64
2025-05-07T20:23:41.4763124Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:41.4763456Z BUILD_TARGET=genai
2025-05-07T20:23:41.4763999Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_0aae4d04-12f7-4bb7-9b39-789ed9ac7062
2025-05-07T20:23:41.4764886Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_0aae4d04-12f7-4bb7-9b39-789ed9ac7062
2025-05-07T20:23:41.4765635Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:41.4766324Z INVOCATION_ID=1a5cf068cbd9400aa048f8bfcc0aff7d
2025-05-07T20:23:41.4766665Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:41.4766940Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:41.4767533Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_0aae4d04-12f7-4bb7-9b39-789ed9ac7062
2025-05-07T20:23:41.4768171Z BUILD_ENV=build_binary
2025-05-07T20:23:41.4768417Z GITHUB_ACTOR=q10
2025-05-07T20:23:41.4768636Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:41.4768870Z KERN_NAME_LC=linux
2025-05-07T20:23:41.4769104Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:41.4769412Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:41.4769762Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:41.4770017Z USER=ec2-user
2025-05-07T20:23:41.4770253Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:41.4770540Z SHLVL=1
2025-05-07T20:23:41.4770756Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:41.4771110Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:41.4771568Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:41.4771942Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:41.4772191Z KERN_NAME=Linux
2025-05-07T20:23:41.4772424Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:41.4772853Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:41.4773298Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:41.4773578Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:41.4773834Z JOURNAL_STREAM=8:92359
2025-05-07T20:23:41.4774157Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:41.4774527Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:41.4774844Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:41.4775188Z GITHUB_BASE_REF=main
2025-05-07T20:23:41.4775405Z CI=true
2025-05-07T20:23:41.4775621Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:41.4775914Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:41.4776200Z GITHUB_ACTION_REF=
2025-05-07T20:23:41.4776451Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:41.4777084Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_0aae4d04-12f7-4bb7-9b39-789ed9ac7062
2025-05-07T20:23:41.4777688Z MACHINE_NAME=x86_64
2025-05-07T20:23:41.4777914Z _=/usr/bin/printenv
2025-05-07T20:23:41.4778061Z 
2025-05-07T20:23:41.4778191Z ################################################################################
2025-05-07T20:23:41.4778517Z [INFO] Print ldd version ...
2025-05-07T20:23:41.4778780Z + ldd --version
2025-05-07T20:23:41.4778910Z 
2025-05-07T20:23:41.4778998Z ldd (GNU libc) 2.34
2025-05-07T20:23:41.4779274Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:41.4779727Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:41.4780267Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:41.4780735Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:41.4780962Z 
2025-05-07T20:23:41.4781090Z ################################################################################
2025-05-07T20:23:41.4781404Z [INFO] Print CPU info ...
2025-05-07T20:23:41.4781650Z + nproc
2025-05-07T20:23:41.4781770Z 
2025-05-07T20:23:41.4783623Z 16
2025-05-07T20:23:41.4785290Z 
2025-05-07T20:23:41.4785509Z + lscpu
2025-05-07T20:23:41.4785623Z 
2025-05-07T20:23:41.4858427Z Architecture:                         x86_64
2025-05-07T20:23:41.4858978Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:41.4859520Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4859932Z Byte Order:                           Little Endian
2025-05-07T20:23:41.4860320Z CPU(s):                               16
2025-05-07T20:23:41.4860760Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:41.4861223Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:41.4861704Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:41.4862037Z CPU family:                           23
2025-05-07T20:23:41.4862566Z Model:                                49
2025-05-07T20:23:41.4862878Z Thread(s) per core:                   2
2025-05-07T20:23:41.4863183Z Core(s) per socket:                   8
2025-05-07T20:23:41.4863495Z Socket(s):                            1
2025-05-07T20:23:41.4863793Z Stepping:                             0
2025-05-07T20:23:41.4864107Z BogoMIPS:                             5599.99
2025-05-07T20:23:41.4866298Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4868482Z Hypervisor vendor:                    KVM
2025-05-07T20:23:41.4868813Z Virtualization type:                  full
2025-05-07T20:23:41.4869183Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:41.4869569Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:41.4869956Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:41.4870337Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:41.4870685Z NUMA node(s):                         1
2025-05-07T20:23:41.4871028Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:41.4871403Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:41.4871846Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:41.4872393Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:41.4872924Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:41.4873452Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:41.4873964Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:41.4874496Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:41.4875292Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:41.4876132Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:41.4876734Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:41.4877602Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:41.4878493Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:41.4879199Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:41.4879581Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:41.4879922Z 
2025-05-07T20:23:41.4880018Z + cat /proc/cpuinfo
2025-05-07T20:23:41.4880256Z 
2025-05-07T20:23:41.4880354Z processor	: 0
2025-05-07T20:23:41.4880580Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4881001Z cpu family	: 23
2025-05-07T20:23:41.4881221Z model		: 49
2025-05-07T20:23:41.4881437Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4881699Z stepping	: 0
2025-05-07T20:23:41.4881924Z microcode	: 0x830107f
2025-05-07T20:23:41.4882157Z cpu MHz		: 3302.129
2025-05-07T20:23:41.4882384Z cache size	: 512 KB
2025-05-07T20:23:41.4882610Z physical id	: 0
2025-05-07T20:23:41.4882825Z siblings	: 16
2025-05-07T20:23:41.4883037Z core id		: 0
2025-05-07T20:23:41.4883246Z cpu cores	: 8
2025-05-07T20:23:41.4883459Z apicid		: 0
2025-05-07T20:23:41.4883664Z initial apicid	: 0
2025-05-07T20:23:41.4883885Z fpu		: yes
2025-05-07T20:23:41.4884095Z fpu_exception	: yes
2025-05-07T20:23:41.4884317Z cpuid level	: 13
2025-05-07T20:23:41.4884536Z wp		: yes
2025-05-07T20:23:41.4886674Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4889000Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.4889507Z bogomips	: 5599.99
2025-05-07T20:23:41.4889741Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.4889989Z clflush size	: 64
2025-05-07T20:23:41.4890216Z cache_alignment	: 64
2025-05-07T20:23:41.4890499Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4890835Z power management:
2025-05-07T20:23:41.4890974Z 
2025-05-07T20:23:41.4891067Z processor	: 1
2025-05-07T20:23:41.4891289Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4891539Z cpu family	: 23
2025-05-07T20:23:41.4891762Z model		: 49
2025-05-07T20:23:41.4891976Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4892231Z stepping	: 0
2025-05-07T20:23:41.4892451Z microcode	: 0x830107f
2025-05-07T20:23:41.4892683Z cpu MHz		: 3293.424
2025-05-07T20:23:41.4892907Z cache size	: 512 KB
2025-05-07T20:23:41.4893135Z physical id	: 0
2025-05-07T20:23:41.4893350Z siblings	: 16
2025-05-07T20:23:41.4893564Z core id		: 1
2025-05-07T20:23:41.4893775Z cpu cores	: 8
2025-05-07T20:23:41.4893979Z apicid		: 2
2025-05-07T20:23:41.4894189Z initial apicid	: 2
2025-05-07T20:23:41.4894416Z fpu		: yes
2025-05-07T20:23:41.4894621Z fpu_exception	: yes
2025-05-07T20:23:41.4894853Z cpuid level	: 13
2025-05-07T20:23:41.4895070Z wp		: yes
2025-05-07T20:23:41.4897085Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4899389Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.4899900Z bogomips	: 5599.99
2025-05-07T20:23:41.4900131Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.4900393Z clflush size	: 64
2025-05-07T20:23:41.4900616Z cache_alignment	: 64
2025-05-07T20:23:41.4900906Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4901277Z power management:
2025-05-07T20:23:41.4955547Z 
2025-05-07T20:23:41.4955744Z processor	: 2
2025-05-07T20:23:41.4956000Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4956321Z cpu family	: 23
2025-05-07T20:23:41.4956589Z model		: 49
2025-05-07T20:23:41.4956883Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4957536Z stepping	: 0
2025-05-07T20:23:41.4957761Z microcode	: 0x830107f
2025-05-07T20:23:41.4958001Z cpu MHz		: 1902.706
2025-05-07T20:23:41.4958219Z cache size	: 512 KB
2025-05-07T20:23:41.4958444Z physical id	: 0
2025-05-07T20:23:41.4958669Z siblings	: 16
2025-05-07T20:23:41.4958872Z core id		: 2
2025-05-07T20:23:41.4959082Z cpu cores	: 8
2025-05-07T20:23:41.4959294Z apicid		: 4
2025-05-07T20:23:41.4959495Z initial apicid	: 4
2025-05-07T20:23:41.4959716Z fpu		: yes
2025-05-07T20:23:41.4959923Z fpu_exception	: yes
2025-05-07T20:23:41.4960269Z cpuid level	: 13
2025-05-07T20:23:41.4960490Z wp		: yes
2025-05-07T20:23:41.4962727Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4965104Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.4965611Z bogomips	: 5599.99
2025-05-07T20:23:41.4965845Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.4966094Z clflush size	: 64
2025-05-07T20:23:41.4966319Z cache_alignment	: 64
2025-05-07T20:23:41.4966597Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4966930Z power management:
2025-05-07T20:23:41.4967066Z 
2025-05-07T20:23:41.4967158Z processor	: 3
2025-05-07T20:23:41.4967373Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4967623Z cpu family	: 23
2025-05-07T20:23:41.4967840Z model		: 49
2025-05-07T20:23:41.4968045Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4968293Z stepping	: 0
2025-05-07T20:23:41.4968506Z microcode	: 0x830107f
2025-05-07T20:23:41.4968742Z cpu MHz		: 3298.711
2025-05-07T20:23:41.4968962Z cache size	: 512 KB
2025-05-07T20:23:41.4969185Z physical id	: 0
2025-05-07T20:23:41.4969392Z siblings	: 16
2025-05-07T20:23:41.4969597Z core id		: 3
2025-05-07T20:23:41.4969803Z cpu cores	: 8
2025-05-07T20:23:41.4970005Z apicid		: 6
2025-05-07T20:23:41.4970208Z initial apicid	: 6
2025-05-07T20:23:41.4970428Z fpu		: yes
2025-05-07T20:23:41.4970676Z fpu_exception	: yes
2025-05-07T20:23:41.4970901Z cpuid level	: 13
2025-05-07T20:23:41.4971115Z wp		: yes
2025-05-07T20:23:41.4973161Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4975497Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.4976005Z bogomips	: 5599.99
2025-05-07T20:23:41.4976237Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.4976482Z clflush size	: 64
2025-05-07T20:23:41.4976701Z cache_alignment	: 64
2025-05-07T20:23:41.4976987Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4977319Z power management:
2025-05-07T20:23:41.4977457Z 
2025-05-07T20:23:41.4977543Z processor	: 4
2025-05-07T20:23:41.4977770Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4978012Z cpu family	: 23
2025-05-07T20:23:41.4978220Z model		: 49
2025-05-07T20:23:41.4978441Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4978697Z stepping	: 0
2025-05-07T20:23:41.4978908Z microcode	: 0x830107f
2025-05-07T20:23:41.4979146Z cpu MHz		: 3291.209
2025-05-07T20:23:41.4979457Z cache size	: 512 KB
2025-05-07T20:23:41.4979673Z physical id	: 0
2025-05-07T20:23:41.4979891Z siblings	: 16
2025-05-07T20:23:41.4980097Z core id		: 4
2025-05-07T20:23:41.4980294Z cpu cores	: 8
2025-05-07T20:23:41.4980500Z apicid		: 8
2025-05-07T20:23:41.4980707Z initial apicid	: 8
2025-05-07T20:23:41.4980918Z fpu		: yes
2025-05-07T20:23:41.4981194Z fpu_exception	: yes
2025-05-07T20:23:41.4981432Z cpuid level	: 13
2025-05-07T20:23:41.4981647Z wp		: yes
2025-05-07T20:23:41.4983741Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4986044Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.4986545Z bogomips	: 5599.99
2025-05-07T20:23:41.4986771Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.4987007Z clflush size	: 64
2025-05-07T20:23:41.4987232Z cache_alignment	: 64
2025-05-07T20:23:41.4987511Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4987831Z power management:
2025-05-07T20:23:41.4987973Z 
2025-05-07T20:23:41.4988057Z processor	: 5
2025-05-07T20:23:41.4988277Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4988519Z cpu family	: 23
2025-05-07T20:23:41.4988723Z model		: 49
2025-05-07T20:23:41.4988937Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4989189Z stepping	: 0
2025-05-07T20:23:41.4989399Z microcode	: 0x830107f
2025-05-07T20:23:41.4989631Z cpu MHz		: 2859.366
2025-05-07T20:23:41.4989852Z cache size	: 512 KB
2025-05-07T20:23:41.4990069Z physical id	: 0
2025-05-07T20:23:41.4990292Z siblings	: 16
2025-05-07T20:23:41.4990499Z core id		: 5
2025-05-07T20:23:41.4990698Z cpu cores	: 8
2025-05-07T20:23:41.4990901Z apicid		: 10
2025-05-07T20:23:41.4991111Z initial apicid	: 10
2025-05-07T20:23:41.4991322Z fpu		: yes
2025-05-07T20:23:41.4991529Z fpu_exception	: yes
2025-05-07T20:23:41.4991753Z cpuid level	: 13
2025-05-07T20:23:41.4991962Z wp		: yes
2025-05-07T20:23:41.4993978Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.4996279Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.4996785Z bogomips	: 5599.99
2025-05-07T20:23:41.4997018Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.4997255Z clflush size	: 64
2025-05-07T20:23:41.4997485Z cache_alignment	: 64
2025-05-07T20:23:41.4997764Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.4998088Z power management:
2025-05-07T20:23:41.4998230Z 
2025-05-07T20:23:41.4998315Z processor	: 6
2025-05-07T20:23:41.4998536Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.4998774Z cpu family	: 23
2025-05-07T20:23:41.4998988Z model		: 49
2025-05-07T20:23:41.4999212Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.4999453Z stepping	: 0
2025-05-07T20:23:41.4999671Z microcode	: 0x830107f
2025-05-07T20:23:41.4999910Z cpu MHz		: 3300.036
2025-05-07T20:23:41.5000200Z cache size	: 512 KB
2025-05-07T20:23:41.5000425Z physical id	: 0
2025-05-07T20:23:41.5000637Z siblings	: 16
2025-05-07T20:23:41.5000834Z core id		: 6
2025-05-07T20:23:41.5001126Z cpu cores	: 8
2025-05-07T20:23:41.5001332Z apicid		: 12
2025-05-07T20:23:41.5001534Z initial apicid	: 12
2025-05-07T20:23:41.5001747Z fpu		: yes
2025-05-07T20:23:41.5001951Z fpu_exception	: yes
2025-05-07T20:23:41.5002165Z cpuid level	: 13
2025-05-07T20:23:41.5002373Z wp		: yes
2025-05-07T20:23:41.5004458Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5006751Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5007253Z bogomips	: 5599.99
2025-05-07T20:23:41.5007470Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5007709Z clflush size	: 64
2025-05-07T20:23:41.5007928Z cache_alignment	: 64
2025-05-07T20:23:41.5008198Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5008518Z power management:
2025-05-07T20:23:41.5008651Z 
2025-05-07T20:23:41.5008741Z processor	: 7
2025-05-07T20:23:41.5008959Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5009206Z cpu family	: 23
2025-05-07T20:23:41.5009421Z model		: 49
2025-05-07T20:23:41.5009628Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5009878Z stepping	: 0
2025-05-07T20:23:41.5010098Z microcode	: 0x830107f
2025-05-07T20:23:41.5010326Z cpu MHz		: 3299.648
2025-05-07T20:23:41.5010561Z cache size	: 512 KB
2025-05-07T20:23:41.5010799Z physical id	: 0
2025-05-07T20:23:41.5011053Z siblings	: 16
2025-05-07T20:23:41.5011252Z core id		: 7
2025-05-07T20:23:41.5011458Z cpu cores	: 8
2025-05-07T20:23:41.5011667Z apicid		: 14
2025-05-07T20:23:41.5011869Z initial apicid	: 14
2025-05-07T20:23:41.5012092Z fpu		: yes
2025-05-07T20:23:41.5012298Z fpu_exception	: yes
2025-05-07T20:23:41.5012514Z cpuid level	: 13
2025-05-07T20:23:41.5012731Z wp		: yes
2025-05-07T20:23:41.5015102Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5017570Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5018071Z bogomips	: 5599.99
2025-05-07T20:23:41.5018307Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5018553Z clflush size	: 64
2025-05-07T20:23:41.5018772Z cache_alignment	: 64
2025-05-07T20:23:41.5019052Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5019380Z power management:
2025-05-07T20:23:41.5019516Z 
2025-05-07T20:23:41.5019610Z processor	: 8
2025-05-07T20:23:41.5019825Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5020075Z cpu family	: 23
2025-05-07T20:23:41.5020292Z model		: 49
2025-05-07T20:23:41.5020500Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5020753Z stepping	: 0
2025-05-07T20:23:41.5020969Z microcode	: 0x830107f
2025-05-07T20:23:41.5021196Z cpu MHz		: 3299.504
2025-05-07T20:23:41.5021420Z cache size	: 512 KB
2025-05-07T20:23:41.5021647Z physical id	: 0
2025-05-07T20:23:41.5021862Z siblings	: 16
2025-05-07T20:23:41.5022073Z core id		: 0
2025-05-07T20:23:41.5022280Z cpu cores	: 8
2025-05-07T20:23:41.5022479Z apicid		: 1
2025-05-07T20:23:41.5022682Z initial apicid	: 1
2025-05-07T20:23:41.5023058Z fpu		: yes
2025-05-07T20:23:41.5023261Z fpu_exception	: yes
2025-05-07T20:23:41.5023491Z cpuid level	: 13
2025-05-07T20:23:41.5023707Z wp		: yes
2025-05-07T20:23:41.5025701Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5028145Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5028648Z bogomips	: 5599.99
2025-05-07T20:23:41.5028877Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5029117Z clflush size	: 64
2025-05-07T20:23:41.5029343Z cache_alignment	: 64
2025-05-07T20:23:41.5029632Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5029957Z power management:
2025-05-07T20:23:41.5030092Z 
2025-05-07T20:23:41.5030179Z processor	: 9
2025-05-07T20:23:41.5030404Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5030652Z cpu family	: 23
2025-05-07T20:23:41.5030860Z model		: 49
2025-05-07T20:23:41.5031074Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5031322Z stepping	: 0
2025-05-07T20:23:41.5031533Z microcode	: 0x830107f
2025-05-07T20:23:41.5031773Z cpu MHz		: 3291.473
2025-05-07T20:23:41.5031992Z cache size	: 512 KB
2025-05-07T20:23:41.5032213Z physical id	: 0
2025-05-07T20:23:41.5032432Z siblings	: 16
2025-05-07T20:23:41.5032638Z core id		: 1
2025-05-07T20:23:41.5032847Z cpu cores	: 8
2025-05-07T20:23:41.5033047Z apicid		: 3
2025-05-07T20:23:41.5033253Z initial apicid	: 3
2025-05-07T20:23:41.5033471Z fpu		: yes
2025-05-07T20:23:41.5033670Z fpu_exception	: yes
2025-05-07T20:23:41.5033898Z cpuid level	: 13
2025-05-07T20:23:41.5034111Z wp		: yes
2025-05-07T20:23:41.5036102Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5038388Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5038892Z bogomips	: 5599.99
2025-05-07T20:23:41.5039119Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5039356Z clflush size	: 64
2025-05-07T20:23:41.5039581Z cache_alignment	: 64
2025-05-07T20:23:41.5039868Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5040243Z power management:
2025-05-07T20:23:41.5040385Z 
2025-05-07T20:23:41.5040470Z processor	: 10
2025-05-07T20:23:41.5040695Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5040946Z cpu family	: 23
2025-05-07T20:23:41.5041153Z model		: 49
2025-05-07T20:23:41.5041364Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5041613Z stepping	: 0
2025-05-07T20:23:41.5041822Z microcode	: 0x830107f
2025-05-07T20:23:41.5042057Z cpu MHz		: 3225.788
2025-05-07T20:23:41.5042277Z cache size	: 512 KB
2025-05-07T20:23:41.5042494Z physical id	: 0
2025-05-07T20:23:41.5042710Z siblings	: 16
2025-05-07T20:23:41.5042920Z core id		: 2
2025-05-07T20:23:41.5043122Z cpu cores	: 8
2025-05-07T20:23:41.5043336Z apicid		: 5
2025-05-07T20:23:41.5043546Z initial apicid	: 5
2025-05-07T20:23:41.5043756Z fpu		: yes
2025-05-07T20:23:41.5043962Z fpu_exception	: yes
2025-05-07T20:23:41.5044188Z cpuid level	: 13
2025-05-07T20:23:41.5044485Z wp		: yes
2025-05-07T20:23:41.5046493Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5048783Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5049287Z bogomips	: 5599.99
2025-05-07T20:23:41.5049603Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5049843Z clflush size	: 64
2025-05-07T20:23:41.5050067Z cache_alignment	: 64
2025-05-07T20:23:41.5050346Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5050670Z power management:
2025-05-07T20:23:41.5050813Z 
2025-05-07T20:23:41.5050920Z processor	: 11
2025-05-07T20:23:41.5051170Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5051412Z cpu family	: 23
2025-05-07T20:23:41.5051623Z model		: 49
2025-05-07T20:23:41.5051836Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5052079Z stepping	: 0
2025-05-07T20:23:41.5052296Z microcode	: 0x830107f
2025-05-07T20:23:41.5052528Z cpu MHz		: 3300.277
2025-05-07T20:23:41.5052742Z cache size	: 512 KB
2025-05-07T20:23:41.5052963Z physical id	: 0
2025-05-07T20:23:41.5053178Z siblings	: 16
2025-05-07T20:23:41.5053376Z core id		: 3
2025-05-07T20:23:41.5053583Z cpu cores	: 8
2025-05-07T20:23:41.5053786Z apicid		: 7
2025-05-07T20:23:41.5053988Z initial apicid	: 7
2025-05-07T20:23:41.5054208Z fpu		: yes
2025-05-07T20:23:41.5054414Z fpu_exception	: yes
2025-05-07T20:23:41.5054634Z cpuid level	: 13
2025-05-07T20:23:41.5054845Z wp		: yes
2025-05-07T20:23:41.5056878Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5059163Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5059662Z bogomips	: 5599.99
2025-05-07T20:23:41.5059880Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5060127Z clflush size	: 64
2025-05-07T20:23:41.5060353Z cache_alignment	: 64
2025-05-07T20:23:41.5060627Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5060980Z power management:
2025-05-07T20:23:41.5061136Z 
2025-05-07T20:23:41.5061226Z processor	: 12
2025-05-07T20:23:41.5061443Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5061690Z cpu family	: 23
2025-05-07T20:23:41.5061904Z model		: 49
2025-05-07T20:23:41.5062108Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5062357Z stepping	: 0
2025-05-07T20:23:41.5062570Z microcode	: 0x830107f
2025-05-07T20:23:41.5062797Z cpu MHz		: 3290.349
2025-05-07T20:23:41.5063019Z cache size	: 512 KB
2025-05-07T20:23:41.5063243Z physical id	: 0
2025-05-07T20:23:41.5063454Z siblings	: 16
2025-05-07T20:23:41.5063662Z core id		: 4
2025-05-07T20:23:41.5063875Z cpu cores	: 8
2025-05-07T20:23:41.5064087Z apicid		: 9
2025-05-07T20:23:41.5064289Z initial apicid	: 9
2025-05-07T20:23:41.5064506Z fpu		: yes
2025-05-07T20:23:41.5064710Z fpu_exception	: yes
2025-05-07T20:23:41.5064931Z cpuid level	: 13
2025-05-07T20:23:41.5065143Z wp		: yes
2025-05-07T20:23:41.5067148Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5069522Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5070019Z bogomips	: 5599.99
2025-05-07T20:23:41.5070247Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5070491Z clflush size	: 64
2025-05-07T20:23:41.5070711Z cache_alignment	: 64
2025-05-07T20:23:41.5071075Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5071401Z power management:
2025-05-07T20:23:41.5071536Z 
2025-05-07T20:23:41.5071628Z processor	: 13
2025-05-07T20:23:41.5071852Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5072095Z cpu family	: 23
2025-05-07T20:23:41.5072306Z model		: 49
2025-05-07T20:23:41.5072510Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5072757Z stepping	: 0
2025-05-07T20:23:41.5072974Z microcode	: 0x830107f
2025-05-07T20:23:41.5073197Z cpu MHz		: 2848.281
2025-05-07T20:23:41.5073414Z cache size	: 512 KB
2025-05-07T20:23:41.5073634Z physical id	: 0
2025-05-07T20:23:41.5073841Z siblings	: 16
2025-05-07T20:23:41.5074052Z core id		: 5
2025-05-07T20:23:41.5074259Z cpu cores	: 8
2025-05-07T20:23:41.5074464Z apicid		: 11
2025-05-07T20:23:41.5074673Z initial apicid	: 11
2025-05-07T20:23:41.5074893Z fpu		: yes
2025-05-07T20:23:41.5075097Z fpu_exception	: yes
2025-05-07T20:23:41.5075323Z cpuid level	: 13
2025-05-07T20:23:41.5075539Z wp		: yes
2025-05-07T20:23:41.5077540Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5079833Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5080397Z bogomips	: 5599.99
2025-05-07T20:23:41.5080626Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5080886Z clflush size	: 64
2025-05-07T20:23:41.5081134Z cache_alignment	: 64
2025-05-07T20:23:41.5081412Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5081744Z power management:
2025-05-07T20:23:41.5081879Z 
2025-05-07T20:23:41.5081966Z processor	: 14
2025-05-07T20:23:41.5082192Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5082442Z cpu family	: 23
2025-05-07T20:23:41.5082654Z model		: 49
2025-05-07T20:23:41.5082870Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5083117Z stepping	: 0
2025-05-07T20:23:41.5083329Z microcode	: 0x830107f
2025-05-07T20:23:41.5083563Z cpu MHz		: 3295.061
2025-05-07T20:23:41.5083784Z cache size	: 512 KB
2025-05-07T20:23:41.5084000Z physical id	: 0
2025-05-07T20:23:41.5084221Z siblings	: 16
2025-05-07T20:23:41.5084427Z core id		: 6
2025-05-07T20:23:41.5084626Z cpu cores	: 8
2025-05-07T20:23:41.5084830Z apicid		: 13
2025-05-07T20:23:41.5085038Z initial apicid	: 13
2025-05-07T20:23:41.5085252Z fpu		: yes
2025-05-07T20:23:41.5085456Z fpu_exception	: yes
2025-05-07T20:23:41.5085677Z cpuid level	: 13
2025-05-07T20:23:41.5085881Z wp		: yes
2025-05-07T20:23:41.5087883Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5090265Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5090760Z bogomips	: 5599.99
2025-05-07T20:23:41.5090987Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5091223Z clflush size	: 64
2025-05-07T20:23:41.5091445Z cache_alignment	: 64
2025-05-07T20:23:41.5091722Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5092036Z power management:
2025-05-07T20:23:41.5092179Z 
2025-05-07T20:23:41.5092352Z processor	: 15
2025-05-07T20:23:41.5092574Z vendor_id	: AuthenticAMD
2025-05-07T20:23:41.5092811Z cpu family	: 23
2025-05-07T20:23:41.5093021Z model		: 49
2025-05-07T20:23:41.5093229Z model name	: AMD EPYC 7R32
2025-05-07T20:23:41.5093474Z stepping	: 0
2025-05-07T20:23:41.5093691Z microcode	: 0x830107f
2025-05-07T20:23:41.5093919Z cpu MHz		: 3294.378
2025-05-07T20:23:41.5094130Z cache size	: 512 KB
2025-05-07T20:23:41.5094351Z physical id	: 0
2025-05-07T20:23:41.5094564Z siblings	: 16
2025-05-07T20:23:41.5094770Z core id		: 7
2025-05-07T20:23:41.5094967Z cpu cores	: 8
2025-05-07T20:23:41.5095171Z apicid		: 15
2025-05-07T20:23:41.5095379Z initial apicid	: 15
2025-05-07T20:23:41.5095588Z fpu		: yes
2025-05-07T20:23:41.5095790Z fpu_exception	: yes
2025-05-07T20:23:41.5096011Z cpuid level	: 13
2025-05-07T20:23:41.5096214Z wp		: yes
2025-05-07T20:23:41.5098218Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:41.5100501Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:41.5101001Z bogomips	: 5599.99
2025-05-07T20:23:41.5101218Z TLB size	: 3072 4K pages
2025-05-07T20:23:41.5101456Z clflush size	: 64
2025-05-07T20:23:41.5101680Z cache_alignment	: 64
2025-05-07T20:23:41.5101956Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:41.5109234Z power management:
2025-05-07T20:23:41.5109383Z 
2025-05-07T20:23:41.5109387Z 
2025-05-07T20:23:41.5109520Z ################################################################################
2025-05-07T20:23:41.5109846Z [INFO] Print PCI info ...
2025-05-07T20:23:41.5110096Z + lspci -v
2025-05-07T20:23:41.5110222Z 
2025-05-07T20:23:41.5110449Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:41.5110851Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:41.5111183Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:41.5111399Z 
2025-05-07T20:23:41.5111603Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:41.5111998Z 	Physical Slot: 1
2025-05-07T20:23:41.5112250Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:41.5112460Z 
2025-05-07T20:23:41.5112720Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:41.5113160Z 	Physical Slot: 1
2025-05-07T20:23:41.5113747Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:41.5113998Z 
2025-05-07T20:23:41.5114283Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:41.5114737Z 	Physical Slot: 3
2025-05-07T20:23:41.5114993Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:41.5115521Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:41.5115884Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:41.5116121Z 
2025-05-07T20:23:41.5116428Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:41.5116950Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:41.5117245Z 	Physical Slot: 4
2025-05-07T20:23:41.5117504Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:41.5117897Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:41.5118270Z 	Capabilities: <access denied>
2025-05-07T20:23:41.5118557Z 	Kernel driver in use: nvme
2025-05-07T20:23:41.5118731Z 
2025-05-07T20:23:41.5119106Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:41.5119605Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:41.5119960Z 	Physical Slot: 5
2025-05-07T20:23:41.5120281Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:41.5120659Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:41.5121051Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:41.5121384Z 	Capabilities: <access denied>
2025-05-07T20:23:41.5121660Z 	Kernel driver in use: ena
2025-05-07T20:23:41.5121909Z 	Kernel modules: ena
2025-05-07T20:23:41.5122053Z 
2025-05-07T20:23:41.5122227Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:41.5122619Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:41.5122924Z 	Physical Slot: 30
2025-05-07T20:23:41.5123185Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:41.5123575Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:41.5123985Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:41.5124369Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:41.5124706Z 	Capabilities: <access denied>
2025-05-07T20:23:41.5124989Z 	Kernel driver in use: nvidia
2025-05-07T20:23:41.5125258Z 	Kernel modules: nvidia
2025-05-07T20:23:41.5125408Z 
2025-05-07T20:23:41.5125718Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:41.5126245Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:41.5126547Z 	Physical Slot: 31
2025-05-07T20:23:41.5126794Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:41.5127160Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:41.5127556Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:41.5127898Z 	Capabilities: <access denied>
2025-05-07T20:23:41.5128162Z 	Kernel driver in use: nvme
2025-05-07T20:23:41.5128337Z 
2025-05-07T20:23:41.5128341Z 
2025-05-07T20:23:41.5128459Z ################################################################################
2025-05-07T20:23:41.5128799Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:41.5129095Z + uname -a
2025-05-07T20:23:41.5129226Z 
2025-05-07T20:23:41.5129643Z Linux ip-10-0-8-106.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:41.5130158Z 
2025-05-07T20:23:41.5130241Z + uname -m
2025-05-07T20:23:41.5130359Z 
2025-05-07T20:23:41.5130441Z x86_64
2025-05-07T20:23:41.5130551Z 
2025-05-07T20:23:41.5130640Z + cat /proc/version
2025-05-07T20:23:41.5130784Z 
2025-05-07T20:23:41.5131337Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:41.5131982Z 
2025-05-07T20:23:41.5132072Z + cat /etc/os-release
2025-05-07T20:23:41.5132220Z 
2025-05-07T20:23:41.5132320Z NAME="Amazon Linux"
2025-05-07T20:23:41.5132538Z VERSION="2023"
2025-05-07T20:23:41.5132748Z ID="amzn"
2025-05-07T20:23:41.5132944Z ID_LIKE="fedora"
2025-05-07T20:23:41.5133150Z VERSION_ID="2023"
2025-05-07T20:23:41.5133481Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:41.5133771Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:41.5134060Z ANSI_COLOR="0;33"
2025-05-07T20:23:41.5134316Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:41.5134721Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:41.5135173Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:41.5135595Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:41.5136051Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:41.5136432Z VENDOR_NAME="AWS"
2025-05-07T20:23:41.5136675Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:41.5136976Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:41.5137133Z 
2025-05-07T20:23:41.5137345Z ################################################################################
2025-05-07T20:23:41.5137662Z # Print EC2 Instance Info
2025-05-07T20:23:41.5137909Z #
2025-05-07T20:23:41.5138130Z # [2025-05-07T20:23:41.507Z] + print_ec2_info 
2025-05-07T20:23:41.5138461Z ################################################################################
2025-05-07T20:23:41.5138680Z 
2025-05-07T20:23:41.5197615Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:41.5309944Z instance-id: i-04dd41b83603cbddd
2025-05-07T20:23:41.5427739Z instance-type: g5.4xlarge
2025-05-07T20:23:41.5469226Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:41.5469660Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:41.5478630Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:41.5479021Z env:
2025-05-07T20:23:41.5479252Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:41.5479589Z   BUILD_ENV: build_binary
2025-05-07T20:23:41.5479855Z   BUILD_TARGET: genai
2025-05-07T20:23:41.5480096Z   BUILD_VARIANT: cuda
2025-05-07T20:23:41.5480430Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:41.5480697Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:41.5481043Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:41.5481419Z ##[endgroup]
2025-05-07T20:23:41.8822761Z ################################################################################
2025-05-07T20:23:41.8823180Z [INFO] Printing general display info ...
2025-05-07T20:23:41.8837076Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:41.9747510Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:41.9757587Z /usr/bin/sudo
2025-05-07T20:23:41.9767987Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:41.9778764Z /usr/bin/yum
2025-05-07T20:23:41.9780435Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:41.9800446Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:42.4526689Z Last metadata expiration check: 0:00:06 ago on Wed May  7 20:23:36 2025.
2025-05-07T20:23:42.5321833Z ================================================================================
2025-05-07T20:23:42.5322188Z WARNING:
2025-05-07T20:23:42.5322475Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:42.5322722Z 
2025-05-07T20:23:42.5322819Z   Available Versions:
2025-05-07T20:23:42.5322978Z 
2025-05-07T20:23:42.5323079Z   Version 2023.7.20250331:
2025-05-07T20:23:42.5323397Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:42.5323664Z 
2025-05-07T20:23:42.5323803Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:42.5324021Z 
2025-05-07T20:23:42.5324118Z     Release notes:
2025-05-07T20:23:42.5324538Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:42.5324928Z 
2025-05-07T20:23:42.5325019Z   Version 2023.7.20250414:
2025-05-07T20:23:42.5325338Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:42.5325596Z 
2025-05-07T20:23:42.5325724Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:42.5325943Z 
2025-05-07T20:23:42.5326029Z     Release notes:
2025-05-07T20:23:42.5326437Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:42.5327047Z 
2025-05-07T20:23:42.5327147Z   Version 2023.7.20250428:
2025-05-07T20:23:42.5327466Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:42.5327725Z 
2025-05-07T20:23:42.5327844Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:42.5328070Z 
2025-05-07T20:23:42.5328157Z     Release notes:
2025-05-07T20:23:42.5328561Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:42.5328934Z 
2025-05-07T20:23:42.5329055Z ================================================================================
2025-05-07T20:23:42.6500385Z Dependencies resolved.
2025-05-07T20:23:42.6784929Z ================================================================================
2025-05-07T20:23:42.6786197Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:42.6786820Z ================================================================================
2025-05-07T20:23:42.6787298Z Upgrading:
2025-05-07T20:23:42.6787875Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:42.6788842Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:42.6789434Z 
2025-05-07T20:23:42.6790027Z Transaction Summary
2025-05-07T20:23:42.6790463Z ================================================================================
2025-05-07T20:23:42.6790965Z Upgrade  2 Packages
2025-05-07T20:23:42.6791192Z 
2025-05-07T20:23:42.6791368Z Total download size: 6.9 M
2025-05-07T20:23:42.6791752Z Downloading Packages:
2025-05-07T20:23:42.7184939Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  32 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:42.7622861Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  69 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:42.7633347Z --------------------------------------------------------------------------------
2025-05-07T20:23:42.7634171Z Total                                            82 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:42.7636738Z Running transaction check
2025-05-07T20:23:42.7740665Z Transaction check succeeded.
2025-05-07T20:23:42.7741105Z Running transaction test
2025-05-07T20:23:42.8034383Z Transaction test succeeded.
2025-05-07T20:23:42.8037144Z Running transaction
2025-05-07T20:23:43.3652786Z   Preparing        :                                                        1/1 
2025-05-07T20:23:43.4710223Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:43.4730248Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:43.4937447Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:43.4938221Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:43.5043703Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:43.5064903Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:43.6524450Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:43.6525050Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:43.6525623Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:43.6526170Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:43.8009343Z ================================================================================
2025-05-07T20:23:43.8009727Z WARNING:
2025-05-07T20:23:43.8009982Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:43.8010229Z 
2025-05-07T20:23:43.8010326Z   Available Versions:
2025-05-07T20:23:43.8010486Z 
2025-05-07T20:23:43.8010579Z   Version 2023.7.20250331:
2025-05-07T20:23:43.8010905Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:43.8011431Z 
2025-05-07T20:23:43.8011559Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:43.8011786Z 
2025-05-07T20:23:43.8011876Z     Release notes:
2025-05-07T20:23:43.8012301Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:43.8012683Z 
2025-05-07T20:23:43.8012796Z   Version 2023.7.20250414:
2025-05-07T20:23:43.8013116Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:43.8013643Z 
2025-05-07T20:23:43.8013767Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:43.8013985Z 
2025-05-07T20:23:43.8014081Z     Release notes:
2025-05-07T20:23:43.8014488Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:43.8014872Z 
2025-05-07T20:23:43.8014966Z   Version 2023.7.20250428:
2025-05-07T20:23:43.8015290Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:43.8015549Z 
2025-05-07T20:23:43.8015681Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:43.8015897Z 
2025-05-07T20:23:43.8015987Z     Release notes:
2025-05-07T20:23:43.8016394Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:43.8016769Z 
2025-05-07T20:23:43.8017113Z ================================================================================
2025-05-07T20:23:43.8578162Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:43.8578848Z 
2025-05-07T20:23:43.8579019Z Upgraded:
2025-05-07T20:23:43.8579729Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:43.8580894Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:43.8581524Z 
2025-05-07T20:23:43.8581628Z Complete!
2025-05-07T20:23:43.9016676Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:43.9038918Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:44.3542686Z Last metadata expiration check: 0:00:08 ago on Wed May  7 20:23:36 2025.
2025-05-07T20:23:44.3780912Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:44.3786800Z Package lshw-B.02.19.2-7.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:44.4191267Z Dependencies resolved.
2025-05-07T20:23:44.4373799Z Nothing to do.
2025-05-07T20:23:44.4374238Z Complete!
2025-05-07T20:23:44.4774365Z + hostname
2025-05-07T20:23:44.4774500Z 
2025-05-07T20:23:44.4788899Z ip-10-0-8-106.ec2.internal
2025-05-07T20:23:44.4790556Z 
2025-05-07T20:23:44.4790797Z + sudo lshw -C display
2025-05-07T20:23:44.4790969Z 
2025-05-07T20:23:44.7280479Z   *-display:0 UNCLAIMED
2025-05-07T20:23:44.7280787Z        description: VGA compatible controller
2025-05-07T20:23:44.7281119Z        product: Amazon.com, Inc.
2025-05-07T20:23:44.7281409Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:44.7281681Z        physical id: 3
2025-05-07T20:23:44.7282059Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:44.7282621Z        version: 00
2025-05-07T20:23:44.7283074Z        width: 32 bits
2025-05-07T20:23:44.7283517Z        clock: 33MHz
2025-05-07T20:23:44.7284013Z        capabilities: vga_controller bus_master
2025-05-07T20:23:44.7284656Z        configuration: latency=0
2025-05-07T20:23:44.7285337Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:44.7286010Z   *-display:1
2025-05-07T20:23:44.7286460Z        description: 3D controller
2025-05-07T20:23:44.7287023Z        product: GA102GL [A10G]
2025-05-07T20:23:44.7287558Z        vendor: NVIDIA Corporation
2025-05-07T20:23:44.7288102Z        physical id: 1e
2025-05-07T20:23:44.7288582Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:44.7289098Z        version: a1
2025-05-07T20:23:44.7289517Z        width: 64 bits
2025-05-07T20:23:44.7289961Z        clock: 33MHz
2025-05-07T20:23:44.7290547Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:44.7291300Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:44.7292271Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:44.7320925Z 
2025-05-07T20:23:44.7321290Z ################################################################################
2025-05-07T20:23:44.7321816Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:44.7450337Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:44.7636600Z Wed May  7 20:23:44 2025       
2025-05-07T20:23:44.7636982Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:44.7637485Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:44.7637995Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:44.7638504Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:44.7639076Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:44.7639514Z |                                         |                        |               MIG M. |
2025-05-07T20:23:44.7640281Z |=========================================+========================+======================|
2025-05-07T20:23:44.7770093Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:44.7770559Z |  0%   30C    P8             26W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:44.7770955Z |                                         |                        |                  N/A |
2025-05-07T20:23:44.7771375Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:44.7774791Z                                                                                          
2025-05-07T20:23:44.7775204Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:44.7775653Z | Processes:                                                                              |
2025-05-07T20:23:44.7776113Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:44.7776555Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:44.7776919Z |=========================================================================================|
2025-05-07T20:23:44.7780032Z |  No running processes found                                                             |
2025-05-07T20:23:44.7780520Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:45.0406990Z ################################################################################
2025-05-07T20:23:45.0407481Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:45.0547830Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:45.0548732Z [CHECK] rocminfo not found
2025-05-07T20:23:45.0558009Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:45.0558860Z [CHECK] rocm-smi not found
2025-05-07T20:23:45.0603345Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:45.0603787Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:45.0615650Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:45.0616015Z env:
2025-05-07T20:23:45.0616254Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:45.0616566Z   BUILD_ENV: build_binary
2025-05-07T20:23:45.0616824Z   BUILD_TARGET: genai
2025-05-07T20:23:45.0617066Z   BUILD_VARIANT: cuda
2025-05-07T20:23:45.0617309Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:45.0617578Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:45.0617894Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:45.0618237Z ##[endgroup]
2025-05-07T20:23:45.3949462Z ################################################################################
2025-05-07T20:23:45.3949835Z # Setup Miniconda
2025-05-07T20:23:45.3950055Z #
2025-05-07T20:23:45.3965940Z # [2025-05-07T20:23:45.396Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:45.3966373Z ################################################################################
2025-05-07T20:23:45.3966599Z 
2025-05-07T20:23:45.3982514Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:45.4925453Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:45.4925997Z [SETUP] A Miniconda installation appears to already exist in /home/ec2-user/miniconda ...
2025-05-07T20:23:45.4926559Z [SETUP] Clearing out directory: /home/ec2-user/miniconda ...
2025-05-07T20:23:45.4926941Z + rm -rf /home/ec2-user/miniconda
2025-05-07T20:23:45.4927144Z 
2025-05-07T20:23:50.9984120Z 
2025-05-07T20:23:50.9984611Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:50.9984917Z 
2025-05-07T20:23:51.0001833Z 
2025-05-07T20:23:51.0002212Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:51.0027239Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:51.9447812Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:51.9448203Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:51.9448475Z 
2025-05-07T20:23:51.9591467Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:52.4028812Z Unpacking payload ...
2025-05-07T20:23:52.9235981Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:53.7230037Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:23:55.8583233Z 
2025-05-07T20:23:55.8583995Z Installing base environment...
2025-05-07T20:23:55.8584231Z 
2025-05-07T20:23:56.9491243Z Preparing transaction: ...working... done
2025-05-07T20:23:59.9640948Z Executing transaction: ...working... done
2025-05-07T20:24:00.6190515Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:00.7085612Z installation finished.
2025-05-07T20:24:00.7093381Z 
2025-05-07T20:24:00.7093668Z + rm -f miniconda.sh
2025-05-07T20:24:00.7093850Z 
2025-05-07T20:24:00.7411469Z 
2025-05-07T20:24:00.7411845Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:24:00.7412209Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:24:00.7412434Z 
2025-05-07T20:24:01.1156219Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:24:01.1156609Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:24:01.1156993Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:24:01.1157386Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:24:01.1157763Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:24:01.1158437Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:24:01.1158896Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:24:01.1159358Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:24:01.1159833Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:24:01.1160490Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:24:01.1161032Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:24:01.1161413Z no change     /home/ec2-user/.bashrc
2025-05-07T20:24:01.1161695Z No action taken.
2025-05-07T20:24:01.1810662Z 
2025-05-07T20:24:01.1811145Z + . /home/ec2-user/.bashrc
2025-05-07T20:24:01.1811439Z 
2025-05-07T20:24:02.0228702Z 
2025-05-07T20:24:02.0229257Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:24:02.0252824Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:15.5434259Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:17.1348581Z Solving environment: - \ | / - \ | / - \ | / done
2025-05-07T20:24:17.2317371Z 
2025-05-07T20:24:17.2317819Z ## Package Plan ##
2025-05-07T20:24:17.2318022Z 
2025-05-07T20:24:17.2318167Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:17.2318415Z 
2025-05-07T20:24:17.2318523Z   added / updated specs:
2025-05-07T20:24:17.2318791Z     - conda-libmamba-solver
2025-05-07T20:24:17.2319066Z     - libarchive
2025-05-07T20:24:17.2319287Z     - libmamba
2025-05-07T20:24:17.2319502Z     - libmambapy
2025-05-07T20:24:17.2319634Z 
2025-05-07T20:24:17.2319638Z 
2025-05-07T20:24:17.2319764Z The following packages will be downloaded:
2025-05-07T20:24:17.2319994Z 
2025-05-07T20:24:17.2320112Z     package                    |            build
2025-05-07T20:24:17.2320536Z     ---------------------------|-----------------
2025-05-07T20:24:17.2320957Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:17.2321486Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:17.2321966Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:17.2322463Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:17.2322933Z     ------------------------------------------------------------
2025-05-07T20:24:17.2323294Z                                            Total:         1.4 MB
2025-05-07T20:24:17.2323516Z 
2025-05-07T20:24:17.2323638Z The following packages will be UPDATED:
2025-05-07T20:24:17.2323851Z 
2025-05-07T20:24:17.2328309Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:17.2329108Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:17.2329503Z 
2025-05-07T20:24:17.2329731Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:17.2330059Z 
2025-05-07T20:24:17.2330385Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:17.2331216Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:17.2331708Z 
2025-05-07T20:24:17.2331712Z 
2025-05-07T20:24:17.2331716Z 
2025-05-07T20:24:17.2332130Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:17.2332511Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:17.2332743Z 
2025-05-07T20:24:17.2333024Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:17.2333279Z 
2025-05-07T20:24:17.2333283Z 
2025-05-07T20:24:17.2340868Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:17.2341143Z 
2025-05-07T20:24:17.2341147Z 
2025-05-07T20:24:17.2341151Z 
2025-05-07T20:24:17.2904558Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:17.2906676Z 
2025-05-07T20:24:17.3015094Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:17.3015484Z 
2025-05-07T20:24:17.3015820Z 
2025-05-07T20:24:17.3015825Z 
2025-05-07T20:24:17.3096210Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:17.3096787Z 
2025-05-07T20:24:17.3186672Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:17.3187053Z 
2025-05-07T20:24:17.3187248Z 
2025-05-07T20:24:17.3274459Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:17.3279253Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:17.3279579Z 
2025-05-07T20:24:17.3279592Z 
2025-05-07T20:24:17.3279596Z 
2025-05-07T20:24:17.3349239Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:17.3349533Z 
2025-05-07T20:24:17.3349538Z 
2025-05-07T20:24:17.3352464Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:17.3352733Z 
2025-05-07T20:24:17.3352738Z 
2025-05-07T20:24:17.4409947Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:17.4410365Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:17.4416732Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:17.4417108Z                                                      
2025-05-07T20:24:17.4417339Z 
2025-05-07T20:24:17.4417551Z                                                      [A
2025-05-07T20:24:17.4417783Z 
2025-05-07T20:24:17.4417787Z 
2025-05-07T20:24:17.4417973Z                                                      [A[A
2025-05-07T20:24:17.4418209Z 
2025-05-07T20:24:17.4418213Z 
2025-05-07T20:24:17.4418216Z 
2025-05-07T20:24:17.4418411Z                                                      [A[A[A done
2025-05-07T20:24:17.5420830Z Preparing transaction: \ done
2025-05-07T20:24:17.6426194Z Verifying transaction: / done
2025-05-07T20:24:18.9445029Z Executing transaction: \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:20.7943950Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:20.7967503Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:21.6219599Z Channels:
2025-05-07T20:24:21.6219858Z  - defaults
2025-05-07T20:24:21.6220079Z Platform: linux-64
2025-05-07T20:24:22.8596027Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:22.9810308Z Solving environment: - \ Channels:
2025-05-07T20:24:22.9810913Z  - defaults
2025-05-07T20:24:22.9811339Z Platform: linux-64
2025-05-07T20:24:23.2772049Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:23.4929011Z Solving environment: - \ | / done
2025-05-07T20:24:23.5746502Z done
2025-05-07T20:24:23.6392613Z 
2025-05-07T20:24:23.6392754Z ## Package Plan ##
2025-05-07T20:24:23.6392908Z 
2025-05-07T20:24:23.6393071Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:23.6393366Z 
2025-05-07T20:24:23.6393491Z   added / updated specs:
2025-05-07T20:24:23.6393744Z     - conda
2025-05-07T20:24:23.6393866Z 
2025-05-07T20:24:23.6393870Z 
2025-05-07T20:24:23.6393993Z The following packages will be downloaded:
2025-05-07T20:24:23.6394239Z 
2025-05-07T20:24:23.6394357Z     package                    |            build
2025-05-07T20:24:23.6394689Z     ---------------------------|-----------------
2025-05-07T20:24:23.6395281Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:23.6395690Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:23.6396076Z     ------------------------------------------------------------
2025-05-07T20:24:23.6396427Z                                            Total:         1.4 MB
2025-05-07T20:24:23.6396642Z 
2025-05-07T20:24:23.6396760Z The following packages will be UPDATED:
2025-05-07T20:24:23.6396979Z 
2025-05-07T20:24:23.6397284Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:23.6397807Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:23.6398064Z 
2025-05-07T20:24:23.6398212Z 
2025-05-07T20:24:23.6398216Z 
2025-05-07T20:24:23.6398372Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:23.6398745Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:23.6398994Z 
2025-05-07T20:24:23.6805066Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:23.7029659Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:23.7030334Z 
2025-05-07T20:24:23.8675708Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:23.8677931Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:23.8997773Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:23.8998148Z 
2025-05-07T20:24:23.8998758Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:23.8999149Z 
2025-05-07T20:24:23.9004132Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:23.9004490Z                                                      
2025-05-07T20:24:23.9004725Z 
2025-05-07T20:24:23.9004903Z                                                      [A done
2025-05-07T20:24:24.0007404Z Preparing transaction: \ done
2025-05-07T20:24:24.1013750Z Verifying transaction: / done
2025-05-07T20:24:26.2040326Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:26.8403397Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:26.8408792Z + conda clean --packages --tarball -y
2025-05-07T20:24:26.8409010Z 
2025-05-07T20:24:27.8609853Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:27.8610217Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:27.9243776Z 
2025-05-07T20:24:27.9251666Z + conda clean --all -y
2025-05-07T20:24:27.9251857Z 
2025-05-07T20:24:28.4754540Z There are no unused tarball(s) to remove.
2025-05-07T20:24:28.4754906Z Will remove 1 index cache(s).
2025-05-07T20:24:28.4755205Z There are no unused package(s) to remove.
2025-05-07T20:24:28.4755525Z There are no tempfile(s) to remove.
2025-05-07T20:24:28.4755863Z There are no logfile(s) to remove.
2025-05-07T20:24:28.5393787Z 
2025-05-07T20:24:28.5398615Z + conda info
2025-05-07T20:24:28.5398766Z 
2025-05-07T20:24:29.2957121Z 
2025-05-07T20:24:29.2957784Z      active environment : base
2025-05-07T20:24:29.2958316Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:29.2958810Z             shell level : 1
2025-05-07T20:24:29.2959231Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:29.2959803Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:29.2960468Z           conda version : 25.3.1
2025-05-07T20:24:29.2960891Z     conda-build version : not installed
2025-05-07T20:24:29.2961346Z          python version : 3.13.2.final.0
2025-05-07T20:24:29.2961783Z                  solver : libmamba (default)
2025-05-07T20:24:29.2962272Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:29.2962737Z                           __conda=25.3.1=0
2025-05-07T20:24:29.2963165Z                           __cuda=12.8=0
2025-05-07T20:24:29.2963590Z                           __glibc=2.34=0
2025-05-07T20:24:29.2964018Z                           __linux=6.1.130=0
2025-05-07T20:24:29.2964450Z                           __unix=0=0
2025-05-07T20:24:29.2965348Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:29.2965979Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:29.2966502Z   conda av metadata url : None
2025-05-07T20:24:29.2967056Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:29.2967706Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:29.2968290Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:29.2968850Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:29.2969413Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:29.2969923Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:29.2970672Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:29.2971189Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:29.2971677Z                platform : linux-64
2025-05-07T20:24:29.2972928Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:29.2974188Z                 UID:GID : 1000:1000
2025-05-07T20:24:29.2974610Z              netrc file : None
2025-05-07T20:24:29.2975001Z            offline mode : False
2025-05-07T20:24:29.2975252Z 
2025-05-07T20:24:29.3629383Z 
2025-05-07T20:24:29.3629953Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:29.3630725Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_0d771e95-4678-43f1-82ee-37ea75e113eb ...
2025-05-07T20:24:29.3631572Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:29.3703770Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13
2025-05-07T20:24:29.3704287Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.13[0m
2025-05-07T20:24:29.3722388Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:29.3722749Z env:
2025-05-07T20:24:29.3722979Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:29.3723293Z   BUILD_ENV: build_binary
2025-05-07T20:24:29.3723549Z   BUILD_TARGET: genai
2025-05-07T20:24:29.3723785Z   BUILD_VARIANT: cuda
2025-05-07T20:24:29.3724023Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:29.3724286Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:29.3724596Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:29.3724933Z ##[endgroup]
2025-05-07T20:24:29.7066087Z ################################################################################
2025-05-07T20:24:29.7066817Z # Create Conda Environment
2025-05-07T20:24:29.7067353Z #
2025-05-07T20:24:29.7080674Z # [2025-05-07T20:24:29.707Z] + create_conda_environment build_binary 3.13
2025-05-07T20:24:29.7081115Z ################################################################################
2025-05-07T20:24:29.7081342Z 
2025-05-07T20:24:29.7095540Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:29.7985293Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:29.7985681Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:29.7986017Z + conda info --envs
2025-05-07T20:24:29.7986159Z 
2025-05-07T20:24:30.5450683Z 
2025-05-07T20:24:30.5451315Z # conda environments:
2025-05-07T20:24:30.5451649Z #
2025-05-07T20:24:30.5451946Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:30.5452249Z 
2025-05-07T20:24:30.6111888Z 
2025-05-07T20:24:30.6112857Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:32.2509697Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:32.2510112Z 
2025-05-07T20:24:32.2525037Z 
2025-05-07T20:24:32.2534747Z [SETUP] Creating new Conda environment (Python 3.13) ...
2025-05-07T20:24:32.2557145Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.13
2025-05-07T20:24:33.0099177Z Channels:
2025-05-07T20:24:33.0099435Z  - defaults
2025-05-07T20:24:33.0099651Z Platform: linux-64
2025-05-07T20:24:34.5753206Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / done
2025-05-07T20:24:34.6760022Z Solving environment: \ done
2025-05-07T20:24:34.7050313Z 
2025-05-07T20:24:34.7050709Z ## Package Plan ##
2025-05-07T20:24:34.7051021Z 
2025-05-07T20:24:34.7051457Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:34.7052134Z 
2025-05-07T20:24:34.7052344Z   added / updated specs:
2025-05-07T20:24:34.7052855Z     - python=3.13
2025-05-07T20:24:34.7053145Z 
2025-05-07T20:24:34.7053154Z 
2025-05-07T20:24:34.7053405Z The following packages will be downloaded:
2025-05-07T20:24:34.7054225Z 
2025-05-07T20:24:34.7054490Z     package                    |            build
2025-05-07T20:24:34.7055159Z     ---------------------------|-----------------
2025-05-07T20:24:34.7055923Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:34.7056488Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:34.7056940Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:34.7057370Z     python_abi-3.13            |          0_cp313           6 KB
2025-05-07T20:24:34.7057767Z     ------------------------------------------------------------
2025-05-07T20:24:34.7058127Z                                            Total:         159 KB
2025-05-07T20:24:34.7058348Z 
2025-05-07T20:24:34.7058485Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:34.7058726Z 
2025-05-07T20:24:34.7058937Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:34.7059413Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:34.7060067Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:34.7060578Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:34.7061095Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:34.7061578Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:34.7062071Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:34.7062523Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:34.7062992Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:34.7063458Z   libmpdec           pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 
2025-05-07T20:24:34.7063955Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:34.7064435Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:34.7064897Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:34.7065346Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:34.7065777Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:34.7066227Z   python             pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 
2025-05-07T20:24:34.7066705Z   python_abi         pkgs/main/linux-64::python_abi-3.13-0_cp313 
2025-05-07T20:24:34.7067162Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:34.7067658Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 
2025-05-07T20:24:34.7068154Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:34.7068569Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:34.7068980Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:34.7069420Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 
2025-05-07T20:24:34.7069856Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:34.7070255Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:34.7070510Z 
2025-05-07T20:24:34.7070514Z 
2025-05-07T20:24:34.7070518Z 
2025-05-07T20:24:34.7070676Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:34.7071091Z ca-certificates-2025 | 129 KB    |            |   0% 
2025-05-07T20:24:34.7071349Z 
2025-05-07T20:24:34.7071615Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A
2025-05-07T20:24:34.7071876Z 
2025-05-07T20:24:34.7071880Z 
2025-05-07T20:24:34.7084454Z python_abi-3.13      | 6 KB      |            |   0% [A[A
2025-05-07T20:24:34.7084721Z 
2025-05-07T20:24:34.7084725Z 
2025-05-07T20:24:34.7084728Z 
2025-05-07T20:24:34.7345095Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A
2025-05-07T20:24:34.7345554Z 
2025-05-07T20:24:34.7413193Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:34.7413998Z 
2025-05-07T20:24:34.7414019Z 
2025-05-07T20:24:34.7419244Z 
2025-05-07T20:24:34.7455183Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:34.7455484Z 
2025-05-07T20:24:34.7455603Z 
2025-05-07T20:24:34.7560088Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:34.7560466Z 
2025-05-07T20:24:34.7560471Z 
2025-05-07T20:24:34.7560477Z 
2025-05-07T20:24:34.7619735Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:34.7620007Z 
2025-05-07T20:24:34.7620543Z 
2025-05-07T20:24:34.7631505Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:34.7739795Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:34.7740057Z 
2025-05-07T20:24:34.7760312Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:34.7765558Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:34.7765929Z                                                      
2025-05-07T20:24:34.7766158Z 
2025-05-07T20:24:34.7766649Z                                                      [A
2025-05-07T20:24:34.7766866Z 
2025-05-07T20:24:34.7766870Z 
2025-05-07T20:24:34.7767041Z                                                      [A[A
2025-05-07T20:24:34.7767258Z 
2025-05-07T20:24:34.7767262Z 
2025-05-07T20:24:34.7767266Z 
2025-05-07T20:24:34.7767450Z                                                      [A[A[A done
2025-05-07T20:24:34.9822079Z Preparing transaction: / - done
2025-05-07T20:24:36.4381425Z Verifying transaction: | / - \ | / - \ | / - \ | done
2025-05-07T20:24:38.7538661Z Executing transaction: - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:24:38.8037946Z #
2025-05-07T20:24:38.8038567Z # To activate this environment, use
2025-05-07T20:24:38.8039149Z #
2025-05-07T20:24:38.8039559Z #     $ conda activate build_binary
2025-05-07T20:24:38.8040092Z #
2025-05-07T20:24:38.8040824Z # To deactivate an active environment, use
2025-05-07T20:24:38.8041440Z #
2025-05-07T20:24:38.8041829Z #     $ conda deactivate
2025-05-07T20:24:38.8042152Z 
2025-05-07T20:24:38.9139629Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:38.9163411Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:41.7915220Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1)
2025-05-07T20:24:41.7915872Z Collecting pip
2025-05-07T20:24:41.7916203Z   Using cached pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:41.7916638Z Using cached pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:41.7917002Z Installing collected packages: pip
2025-05-07T20:24:41.7917313Z   Attempting uninstall: pip
2025-05-07T20:24:41.7917608Z     Found existing installation: pip 25.1
2025-05-07T20:24:41.7917928Z     Uninstalling pip-25.1:
2025-05-07T20:24:41.7918219Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:41.7918575Z Successfully installed pip-25.1.1
2025-05-07T20:24:41.7918788Z 
2025-05-07T20:24:41.8547316Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:41.8571473Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:42.7105253Z Channels:
2025-05-07T20:24:42.7105588Z  - conda-forge
2025-05-07T20:24:42.7105911Z Platform: linux-64
2025-05-07T20:24:53.2558856Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:24:54.9437531Z Solving environment: / - \ | / - done
2025-05-07T20:24:55.0071579Z 
2025-05-07T20:24:55.0072390Z ## Package Plan ##
2025-05-07T20:24:55.0072660Z 
2025-05-07T20:24:55.0072908Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:55.0073229Z 
2025-05-07T20:24:55.0073333Z   added / updated specs:
2025-05-07T20:24:55.0073618Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:24:55.0074311Z 
2025-05-07T20:24:55.0074317Z 
2025-05-07T20:24:55.0074520Z The following packages will be downloaded:
2025-05-07T20:24:55.0074843Z 
2025-05-07T20:24:55.0074966Z     package                    |            build
2025-05-07T20:24:55.0075312Z     ---------------------------|-----------------
2025-05-07T20:24:55.0075701Z     cffi-1.17.1                |  py313hfab6e84_0         289 KB  conda-forge
2025-05-07T20:24:55.0076164Z     cryptography-44.0.3        |  py313h6556f6e_0         1.5 MB  conda-forge
2025-05-07T20:24:55.0076630Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:24:55.0077068Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:24:55.0077511Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:24:55.0077938Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:24:55.0078383Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:24:55.0079071Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:24:55.0079717Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:24:55.0080349Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:24:55.0080793Z     ------------------------------------------------------------
2025-05-07T20:24:55.0081149Z                                            Total:         6.4 MB
2025-05-07T20:24:55.0081368Z 
2025-05-07T20:24:55.0081502Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:55.0081738Z 
2025-05-07T20:24:55.0081948Z   cffi               conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 
2025-05-07T20:24:55.0082512Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 
2025-05-07T20:24:55.0083033Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:24:55.0083744Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:24:55.0084437Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:24:55.0085231Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:24:55.0086150Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:24:55.0086695Z 
2025-05-07T20:24:55.0086879Z The following packages will be UPDATED:
2025-05-07T20:24:55.0087223Z 
2025-05-07T20:24:55.0087834Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:55.0089069Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:24:55.0089921Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:24:55.0090589Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:24:55.0090980Z 
2025-05-07T20:24:55.0090989Z 
2025-05-07T20:24:55.0090994Z 
2025-05-07T20:24:55.0091154Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:55.0091554Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:24:55.0091802Z 
2025-05-07T20:24:55.0093738Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:24:55.0094036Z 
2025-05-07T20:24:55.0094044Z 
2025-05-07T20:24:55.0104706Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:24:55.0105049Z 
2025-05-07T20:24:55.0105053Z 
2025-05-07T20:24:55.0107707Z 
2025-05-07T20:24:55.0124896Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:24:55.0125292Z 
2025-05-07T20:24:55.0125296Z 
2025-05-07T20:24:55.0125300Z 
2025-05-07T20:24:55.0125312Z 
2025-05-07T20:24:55.0146121Z cffi-1.17.1          | 289 KB    |            |   0% [A[A[A[A
2025-05-07T20:24:55.0146392Z 
2025-05-07T20:24:55.0146396Z 
2025-05-07T20:24:55.0146607Z 
2025-05-07T20:24:55.0146610Z 
2025-05-07T20:24:55.0146629Z 
2025-05-07T20:24:55.0148285Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:24:55.0148575Z 
2025-05-07T20:24:55.0148579Z 
2025-05-07T20:24:55.0148583Z 
2025-05-07T20:24:55.0148586Z 
2025-05-07T20:24:55.0148599Z 
2025-05-07T20:24:55.0148603Z 
2025-05-07T20:24:55.0149828Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:24:55.0150169Z 
2025-05-07T20:24:55.0150175Z 
2025-05-07T20:24:55.0150181Z 
2025-05-07T20:24:55.0150186Z 
2025-05-07T20:24:55.0150202Z 
2025-05-07T20:24:55.0150207Z 
2025-05-07T20:24:55.0150216Z 
2025-05-07T20:24:55.0155301Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:24:55.0155768Z 
2025-05-07T20:24:55.0155782Z 
2025-05-07T20:24:55.0155787Z 
2025-05-07T20:24:55.0155793Z 
2025-05-07T20:24:55.0155798Z 
2025-05-07T20:24:55.0155803Z 
2025-05-07T20:24:55.0155809Z 
2025-05-07T20:24:55.0155814Z 
2025-05-07T20:24:55.0156987Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0157312Z 
2025-05-07T20:24:55.0157323Z 
2025-05-07T20:24:55.0157327Z 
2025-05-07T20:24:55.0157330Z 
2025-05-07T20:24:55.0157334Z 
2025-05-07T20:24:55.0157337Z 
2025-05-07T20:24:55.0157341Z 
2025-05-07T20:24:55.0157345Z 
2025-05-07T20:24:55.0157348Z 
2025-05-07T20:24:55.0687390Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.0687702Z 
2025-05-07T20:24:55.0687706Z 
2025-05-07T20:24:55.0687709Z 
2025-05-07T20:24:55.0688877Z 
2025-05-07T20:24:55.1073983Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:55.1088330Z openssl-3.5.0        | 3.0 MB    | ######3    |  63% 
2025-05-07T20:24:55.1089896Z 
2025-05-07T20:24:55.1097304Z cryptography-44.0.3  | 1.5 MB    | #####6     |  56% [A
2025-05-07T20:24:55.1097574Z 
2025-05-07T20:24:55.1103636Z 
2025-05-07T20:24:55.1150296Z libgcc-15.1.0        | 810 KB    | #########8 |  99% [A[A
2025-05-07T20:24:55.1150659Z 
2025-05-07T20:24:55.1150664Z 
2025-05-07T20:24:55.1150678Z 
2025-05-07T20:24:55.1150682Z 
2025-05-07T20:24:55.1154454Z 
2025-05-07T20:24:55.1218387Z pyopenssl-25.0.0     | 120 KB    | #####3     |  53% [A[A[A[A[A
2025-05-07T20:24:55.1218792Z 
2025-05-07T20:24:55.1218796Z 
2025-05-07T20:24:55.1218800Z 
2025-05-07T20:24:55.1218804Z 
2025-05-07T20:24:55.1221025Z 
2025-05-07T20:24:55.1251100Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:55.1251517Z 
2025-05-07T20:24:55.1251523Z 
2025-05-07T20:24:55.1252581Z 
2025-05-07T20:24:55.1378380Z libgomp-15.1.0       | 442 KB    | 3          |   4% [A[A[A
2025-05-07T20:24:55.1378760Z 
2025-05-07T20:24:55.1380376Z 
2025-05-07T20:24:55.1615622Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:55.1616033Z 
2025-05-07T20:24:55.1616039Z 
2025-05-07T20:24:55.1616043Z 
2025-05-07T20:24:55.1616046Z 
2025-05-07T20:24:55.1616050Z 
2025-05-07T20:24:55.1616054Z 
2025-05-07T20:24:55.1683730Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:24:55.1684040Z 
2025-05-07T20:24:55.1684046Z 
2025-05-07T20:24:55.1685336Z 
2025-05-07T20:24:55.1717461Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:55.1717838Z 
2025-05-07T20:24:55.1717842Z 
2025-05-07T20:24:55.1717846Z 
2025-05-07T20:24:55.1717849Z 
2025-05-07T20:24:55.1717853Z 
2025-05-07T20:24:55.1723166Z 
2025-05-07T20:24:55.1864208Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:55.1864791Z 
2025-05-07T20:24:55.1992325Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:55.1992709Z 
2025-05-07T20:24:55.1992713Z 
2025-05-07T20:24:55.1992717Z 
2025-05-07T20:24:55.1992720Z 
2025-05-07T20:24:55.1999009Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:55.1999352Z 
2025-05-07T20:24:55.1999356Z 
2025-05-07T20:24:55.1999360Z 
2025-05-07T20:24:55.1999651Z 
2025-05-07T20:24:55.2022285Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:24:55.2035816Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:55.2036078Z 
2025-05-07T20:24:55.2036082Z 
2025-05-07T20:24:55.2036086Z 
2025-05-07T20:24:55.2036089Z 
2025-05-07T20:24:55.2036093Z 
2025-05-07T20:24:55.2036097Z 
2025-05-07T20:24:55.2036100Z 
2025-05-07T20:24:55.2079265Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:24:55.2079590Z 
2025-05-07T20:24:55.2079594Z 
2025-05-07T20:24:55.2079598Z 
2025-05-07T20:24:55.2079601Z 
2025-05-07T20:24:55.2079605Z 
2025-05-07T20:24:55.2079609Z 
2025-05-07T20:24:55.2081180Z 
2025-05-07T20:24:55.2144686Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:55.2144996Z 
2025-05-07T20:24:55.2145000Z 
2025-05-07T20:24:55.2145004Z 
2025-05-07T20:24:55.2145007Z 
2025-05-07T20:24:55.2145011Z 
2025-05-07T20:24:55.2145015Z 
2025-05-07T20:24:55.2145018Z 
2025-05-07T20:24:55.2145032Z 
2025-05-07T20:24:55.2181000Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.2181318Z 
2025-05-07T20:24:55.2181322Z 
2025-05-07T20:24:55.2181325Z 
2025-05-07T20:24:55.2181329Z 
2025-05-07T20:24:55.2181332Z 
2025-05-07T20:24:55.2181336Z 
2025-05-07T20:24:55.2181340Z 
2025-05-07T20:24:55.2181381Z 
2025-05-07T20:24:55.2194635Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.2195057Z 
2025-05-07T20:24:55.2195061Z 
2025-05-07T20:24:55.2195064Z 
2025-05-07T20:24:55.2195076Z 
2025-05-07T20:24:55.2195080Z 
2025-05-07T20:24:55.2195084Z 
2025-05-07T20:24:55.2195087Z 
2025-05-07T20:24:55.2195091Z 
2025-05-07T20:24:55.2195094Z 
2025-05-07T20:24:55.2223842Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.2224212Z 
2025-05-07T20:24:55.2224219Z 
2025-05-07T20:24:55.2224224Z 
2025-05-07T20:24:55.2224229Z 
2025-05-07T20:24:55.2224235Z 
2025-05-07T20:24:55.2224251Z 
2025-05-07T20:24:55.2224256Z 
2025-05-07T20:24:55.2224262Z 
2025-05-07T20:24:55.2225673Z 
2025-05-07T20:24:55.2247405Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.2247696Z 
2025-05-07T20:24:55.2247700Z 
2025-05-07T20:24:55.2247703Z 
2025-05-07T20:24:55.2247707Z 
2025-05-07T20:24:55.2247952Z 
2025-05-07T20:24:55.2591633Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:24:55.2592351Z 
2025-05-07T20:24:55.2592355Z 
2025-05-07T20:24:55.2592359Z 
2025-05-07T20:24:55.2594655Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:55.2595043Z 
2025-05-07T20:24:55.2595049Z 
2025-05-07T20:24:55.2595055Z 
2025-05-07T20:24:55.2641979Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:24:55.2642250Z 
2025-05-07T20:24:55.2642335Z 
2025-05-07T20:24:55.3333621Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:24:55.3333883Z 
2025-05-07T20:24:55.3333887Z 
2025-05-07T20:24:55.3333899Z 
2025-05-07T20:24:55.3333903Z 
2025-05-07T20:24:55.3333911Z 
2025-05-07T20:24:55.3333914Z 
2025-05-07T20:24:55.3337365Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:55.3337651Z 
2025-05-07T20:24:55.3337655Z 
2025-05-07T20:24:55.3337659Z 
2025-05-07T20:24:55.3337663Z 
2025-05-07T20:24:55.3337666Z 
2025-05-07T20:24:55.3337670Z 
2025-05-07T20:24:55.3463290Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:24:55.3463576Z 
2025-05-07T20:24:55.3463580Z 
2025-05-07T20:24:55.3463584Z 
2025-05-07T20:24:55.3463587Z 
2025-05-07T20:24:55.3463591Z 
2025-05-07T20:24:55.3463595Z 
2025-05-07T20:24:55.3463599Z 
2025-05-07T20:24:55.3467717Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:55.3468030Z 
2025-05-07T20:24:55.3468034Z 
2025-05-07T20:24:55.3468037Z 
2025-05-07T20:24:55.3468041Z 
2025-05-07T20:24:55.3468045Z 
2025-05-07T20:24:55.3468049Z 
2025-05-07T20:24:55.3468052Z 
2025-05-07T20:24:55.3610955Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:24:55.3611261Z 
2025-05-07T20:24:55.3611265Z 
2025-05-07T20:24:55.3611269Z 
2025-05-07T20:24:55.3611272Z 
2025-05-07T20:24:55.3611276Z 
2025-05-07T20:24:55.3611280Z 
2025-05-07T20:24:55.3611283Z 
2025-05-07T20:24:55.3611287Z 
2025-05-07T20:24:55.3617132Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.3617731Z 
2025-05-07T20:24:55.3617739Z 
2025-05-07T20:24:55.3617746Z 
2025-05-07T20:24:55.3617753Z 
2025-05-07T20:24:55.3617761Z 
2025-05-07T20:24:55.3617768Z 
2025-05-07T20:24:55.3617775Z 
2025-05-07T20:24:55.3617783Z 
2025-05-07T20:24:55.4042602Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.4042905Z 
2025-05-07T20:24:55.4042908Z 
2025-05-07T20:24:55.4042912Z 
2025-05-07T20:24:55.4042916Z 
2025-05-07T20:24:55.4042919Z 
2025-05-07T20:24:55.4042923Z 
2025-05-07T20:24:55.4042933Z 
2025-05-07T20:24:55.4042945Z 
2025-05-07T20:24:55.4043063Z 
2025-05-07T20:24:55.4046831Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.4047144Z 
2025-05-07T20:24:55.4047148Z 
2025-05-07T20:24:55.4047152Z 
2025-05-07T20:24:55.4047156Z 
2025-05-07T20:24:55.4047160Z 
2025-05-07T20:24:55.4047163Z 
2025-05-07T20:24:55.4047167Z 
2025-05-07T20:24:55.4047171Z 
2025-05-07T20:24:55.4047174Z 
2025-05-07T20:24:55.4490219Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:24:55.4696290Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:24:55.4696537Z 
2025-05-07T20:24:55.4703995Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:24:55.4704644Z                                                      
2025-05-07T20:24:55.4704995Z 
2025-05-07T20:24:55.4705282Z                                                      [A
2025-05-07T20:24:55.4705621Z 
2025-05-07T20:24:55.4705627Z 
2025-05-07T20:24:55.4705899Z                                                      [A[A
2025-05-07T20:24:55.4706265Z 
2025-05-07T20:24:55.4706271Z 
2025-05-07T20:24:55.4706277Z 
2025-05-07T20:24:55.4706543Z                                                      [A[A[A
2025-05-07T20:24:55.4706866Z 
2025-05-07T20:24:55.4706871Z 
2025-05-07T20:24:55.4706876Z 
2025-05-07T20:24:55.4706882Z 
2025-05-07T20:24:55.4707146Z                                                      [A[A[A[A
2025-05-07T20:24:55.4707474Z 
2025-05-07T20:24:55.4707480Z 
2025-05-07T20:24:55.4707485Z 
2025-05-07T20:24:55.4707490Z 
2025-05-07T20:24:55.4707496Z 
2025-05-07T20:24:55.4707778Z                                                      [A[A[A[A[A
2025-05-07T20:24:55.4708008Z 
2025-05-07T20:24:55.4708014Z 
2025-05-07T20:24:55.4708020Z 
2025-05-07T20:24:55.4708025Z 
2025-05-07T20:24:55.4708030Z 
2025-05-07T20:24:55.4708036Z 
2025-05-07T20:24:55.4708290Z                                                      [A[A[A[A[A[A
2025-05-07T20:24:55.4708622Z 
2025-05-07T20:24:55.4708639Z 
2025-05-07T20:24:55.4708645Z 
2025-05-07T20:24:55.4708657Z 
2025-05-07T20:24:55.4708663Z 
2025-05-07T20:24:55.4708668Z 
2025-05-07T20:24:55.4708673Z 
2025-05-07T20:24:55.4708960Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:24:55.4709314Z 
2025-05-07T20:24:55.4709320Z 
2025-05-07T20:24:55.4709325Z 
2025-05-07T20:24:55.4709330Z 
2025-05-07T20:24:55.4709336Z 
2025-05-07T20:24:55.4709341Z 
2025-05-07T20:24:55.4709347Z 
2025-05-07T20:24:55.4709352Z 
2025-05-07T20:24:55.4709642Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:24:55.4709995Z 
2025-05-07T20:24:55.4710001Z 
2025-05-07T20:24:55.4710006Z 
2025-05-07T20:24:55.4710011Z 
2025-05-07T20:24:55.4710016Z 
2025-05-07T20:24:55.4710021Z 
2025-05-07T20:24:55.4710026Z 
2025-05-07T20:24:55.4710032Z 
2025-05-07T20:24:55.4710037Z 
2025-05-07T20:24:55.4710344Z                                                      [A[A[A[A[A[A[A[A[A done
2025-05-07T20:24:55.5714115Z Preparing transaction: | done
2025-05-07T20:24:55.6719681Z Verifying transaction: - done
2025-05-07T20:24:57.1744943Z Executing transaction: | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:57.3509498Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:24:59.0707323Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:24:59.0723346Z [SETUP] Installing libxcrypt ...
2025-05-07T20:24:59.0746518Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:24:59.9350800Z Channels:
2025-05-07T20:24:59.9351279Z  - conda-forge
2025-05-07T20:24:59.9351746Z Platform: linux-64
2025-05-07T20:25:03.1965649Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:03.5618640Z Solving environment: \ done
2025-05-07T20:25:03.6241393Z 
2025-05-07T20:25:03.6241763Z ## Package Plan ##
2025-05-07T20:25:03.6242058Z 
2025-05-07T20:25:03.6242483Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:03.6243140Z 
2025-05-07T20:25:03.6243811Z   added / updated specs:
2025-05-07T20:25:03.6244319Z     - libxcrypt
2025-05-07T20:25:03.6244593Z 
2025-05-07T20:25:03.6244597Z 
2025-05-07T20:25:03.6244729Z The following packages will be downloaded:
2025-05-07T20:25:03.6244997Z 
2025-05-07T20:25:03.6245126Z     package                    |            build
2025-05-07T20:25:03.6245457Z     ---------------------------|-----------------
2025-05-07T20:25:03.6245859Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:25:03.6246284Z     ------------------------------------------------------------
2025-05-07T20:25:03.6246642Z                                            Total:          98 KB
2025-05-07T20:25:03.6246862Z 
2025-05-07T20:25:03.6246996Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:03.6247235Z 
2025-05-07T20:25:03.6247466Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:25:03.6247785Z 
2025-05-07T20:25:03.6247790Z 
2025-05-07T20:25:03.6247794Z 
2025-05-07T20:25:03.6247948Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:03.8134713Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:25:03.8170037Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:25:03.8269081Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:25:03.8271521Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:25:03.8271995Z                                                      
2025-05-07T20:25:03.8272307Z  done
2025-05-07T20:25:03.9277062Z Preparing transaction: / done
2025-05-07T20:25:04.0281444Z Verifying transaction: \ done
2025-05-07T20:25:04.1287204Z Executing transaction: / done
2025-05-07T20:25:07.5591715Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:07.5592462Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h
2025-05-07T20:25:07.5593072Z 
2025-05-07T20:25:07.5626253Z 
2025-05-07T20:25:09.2007742Z [SETUP] Installed Python version: Python 3.13.2
2025-05-07T20:25:09.2008212Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:09.2044575Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:09.2045047Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:09.2057565Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:09.2057935Z env:
2025-05-07T20:25:09.2058172Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:09.2058482Z   BUILD_ENV: build_binary
2025-05-07T20:25:09.2058739Z   BUILD_TARGET: genai
2025-05-07T20:25:09.2058980Z   BUILD_VARIANT: cuda
2025-05-07T20:25:09.2059225Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:09.2059498Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:09.2059815Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:09.2060168Z ##[endgroup]
2025-05-07T20:25:09.5431703Z ################################################################################
2025-05-07T20:25:09.5432431Z # Install C/C++ Compilers
2025-05-07T20:25:09.5432684Z #
2025-05-07T20:25:09.5447234Z # [2025-05-07T20:25:09.544Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:09.5447887Z ################################################################################
2025-05-07T20:25:09.5448257Z 
2025-05-07T20:25:09.5462734Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:09.6357531Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:09.6368447Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:09.6390321Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:10.5008802Z Channels:
2025-05-07T20:25:10.5009092Z  - conda-forge
2025-05-07T20:25:10.5009335Z Platform: linux-64
2025-05-07T20:25:13.7872784Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:14.1532904Z Solving environment: \ done
2025-05-07T20:25:14.2156815Z 
2025-05-07T20:25:14.2157447Z ## Package Plan ##
2025-05-07T20:25:14.2157646Z 
2025-05-07T20:25:14.2157982Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:14.2158319Z 
2025-05-07T20:25:14.2158461Z   added / updated specs:
2025-05-07T20:25:14.2158878Z     - sysroot_linux-64=2.17
2025-05-07T20:25:14.2159149Z 
2025-05-07T20:25:14.2159155Z 
2025-05-07T20:25:14.2159346Z The following packages will be downloaded:
2025-05-07T20:25:14.2159711Z 
2025-05-07T20:25:14.2159896Z     package                    |            build
2025-05-07T20:25:14.2160501Z     ---------------------------|-----------------
2025-05-07T20:25:14.2161078Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:14.2161589Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:14.2162027Z     ------------------------------------------------------------
2025-05-07T20:25:14.2162402Z                                            Total:        15.4 MB
2025-05-07T20:25:14.2162622Z 
2025-05-07T20:25:14.2162759Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:14.2163008Z 
2025-05-07T20:25:14.2163309Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:14.2163899Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:14.2164223Z 
2025-05-07T20:25:14.2164228Z 
2025-05-07T20:25:14.2164232Z 
2025-05-07T20:25:14.2164389Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:14.2164954Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:14.2166081Z 
2025-05-07T20:25:14.4513853Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:14.4542250Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:14.4542619Z 
2025-05-07T20:25:14.4714105Z kernel-headers_linux | 921 KB    | 1          |   2% [A
2025-05-07T20:25:14.4715281Z 
2025-05-07T20:25:14.5514189Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:14.6659159Z sysroot_linux-64-2.1 | 14.5 MB   | ####9      |  49% 
2025-05-07T20:25:14.6660335Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:14.7341440Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:14.7341714Z 
2025-05-07T20:25:14.7345893Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:14.7346162Z 
2025-05-07T20:25:15.2399785Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:15.2407016Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:15.2408004Z                                                      
2025-05-07T20:25:15.2408294Z 
2025-05-07T20:25:15.2408540Z                                                      [A done
2025-05-07T20:25:15.3411655Z Preparing transaction: / done
2025-05-07T20:25:15.5418536Z Verifying transaction: \ | done
2025-05-07T20:25:15.7501242Z Executing transaction: - \ done
2025-05-07T20:25:15.9042064Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:15.9042394Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:17.5857350Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:17.5870127Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:17.5892481Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:18.4776466Z Channels:
2025-05-07T20:25:18.4776732Z  - conda-forge
2025-05-07T20:25:18.4776969Z Platform: linux-64
2025-05-07T20:25:21.7254732Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:22.6964996Z Solving environment: \ | / done
2025-05-07T20:25:22.7607092Z 
2025-05-07T20:25:22.7607692Z ## Package Plan ##
2025-05-07T20:25:22.7607963Z 
2025-05-07T20:25:22.7608275Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:22.7608722Z 
2025-05-07T20:25:22.7608852Z   added / updated specs:
2025-05-07T20:25:22.7609197Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:22.7609409Z 
2025-05-07T20:25:22.7609432Z 
2025-05-07T20:25:22.7609603Z The following packages will be downloaded:
2025-05-07T20:25:22.7609842Z 
2025-05-07T20:25:22.7609970Z     package                    |            build
2025-05-07T20:25:22.7610299Z     ---------------------------|-----------------
2025-05-07T20:25:22.7610717Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:22.7611217Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:22.7611695Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:22.7612157Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:22.7612617Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:22.7613084Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:22.7613782Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:22.7614275Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:22.7614775Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:22.7615229Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:22.7615722Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:22.7616218Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:22.7616637Z     ------------------------------------------------------------
2025-05-07T20:25:22.7616990Z                                            Total:        91.6 MB
2025-05-07T20:25:22.7617215Z 
2025-05-07T20:25:22.7617350Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:22.7617583Z 
2025-05-07T20:25:22.7617873Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:22.7618886Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:22.7619447Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:22.7619978Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:22.7620510Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:22.7621036Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:22.7621583Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:22.7622166Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:22.7622686Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:22.7623409Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:22.7623784Z 
2025-05-07T20:25:22.7623904Z The following packages will be UPDATED:
2025-05-07T20:25:22.7624133Z 
2025-05-07T20:25:22.7624455Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:22.7625202Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:22.7625623Z 
2025-05-07T20:25:22.7625627Z 
2025-05-07T20:25:22.7625631Z 
2025-05-07T20:25:22.7625788Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:22.7626183Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:22.7626428Z 
2025-05-07T20:25:22.7626827Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:22.7627082Z 
2025-05-07T20:25:22.7627086Z 
2025-05-07T20:25:22.7632261Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:22.7632663Z 
2025-05-07T20:25:22.7632668Z 
2025-05-07T20:25:22.7632674Z 
2025-05-07T20:25:22.7654939Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:22.7655331Z 
2025-05-07T20:25:22.7655336Z 
2025-05-07T20:25:22.7655340Z 
2025-05-07T20:25:22.7655343Z 
2025-05-07T20:25:22.7682642Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:22.7682972Z 
2025-05-07T20:25:22.7682978Z 
2025-05-07T20:25:22.7682983Z 
2025-05-07T20:25:22.7682988Z 
2025-05-07T20:25:22.7682994Z 
2025-05-07T20:25:22.7684635Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:22.7685093Z 
2025-05-07T20:25:22.7685100Z 
2025-05-07T20:25:22.7685105Z 
2025-05-07T20:25:22.7685110Z 
2025-05-07T20:25:22.7685116Z 
2025-05-07T20:25:22.7686532Z 
2025-05-07T20:25:22.7688422Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:22.7688737Z 
2025-05-07T20:25:22.7688741Z 
2025-05-07T20:25:22.7688754Z 
2025-05-07T20:25:22.7688768Z 
2025-05-07T20:25:22.7688772Z 
2025-05-07T20:25:22.7688776Z 
2025-05-07T20:25:22.7689590Z 
2025-05-07T20:25:22.7691540Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:22.7691865Z 
2025-05-07T20:25:22.7691870Z 
2025-05-07T20:25:22.7691873Z 
2025-05-07T20:25:22.7691877Z 
2025-05-07T20:25:22.7691880Z 
2025-05-07T20:25:22.7691884Z 
2025-05-07T20:25:22.7691888Z 
2025-05-07T20:25:22.7691891Z 
2025-05-07T20:25:22.7694577Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:22.7694886Z 
2025-05-07T20:25:22.7694889Z 
2025-05-07T20:25:22.7694893Z 
2025-05-07T20:25:22.7694897Z 
2025-05-07T20:25:22.7694900Z 
2025-05-07T20:25:22.7694904Z 
2025-05-07T20:25:22.7694908Z 
2025-05-07T20:25:22.7694912Z 
2025-05-07T20:25:22.7694916Z 
2025-05-07T20:25:22.7698791Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:22.7699100Z 
2025-05-07T20:25:22.7699104Z 
2025-05-07T20:25:22.7699117Z 
2025-05-07T20:25:22.7699121Z 
2025-05-07T20:25:22.7699125Z 
2025-05-07T20:25:22.7699128Z 
2025-05-07T20:25:22.7699132Z 
2025-05-07T20:25:22.7699135Z 
2025-05-07T20:25:22.7699139Z 
2025-05-07T20:25:22.7699281Z 
2025-05-07T20:25:22.7700208Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:22.7700494Z 
2025-05-07T20:25:22.7700498Z 
2025-05-07T20:25:22.7700502Z 
2025-05-07T20:25:22.7700506Z 
2025-05-07T20:25:22.7700510Z 
2025-05-07T20:25:22.7700517Z 
2025-05-07T20:25:22.7700520Z 
2025-05-07T20:25:22.7700524Z 
2025-05-07T20:25:22.7700527Z 
2025-05-07T20:25:22.7700540Z 
2025-05-07T20:25:22.7715334Z 
2025-05-07T20:25:22.8711193Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:22.8711579Z 
2025-05-07T20:25:22.8711584Z 
2025-05-07T20:25:22.8711596Z 
2025-05-07T20:25:22.8711992Z 
2025-05-07T20:25:22.8724024Z libstdcxx-15.1.0     | 3.7 MB    | #1         |  12% [A[A[A[A
2025-05-07T20:25:22.8724595Z 
2025-05-07T20:25:22.8724599Z 
2025-05-07T20:25:22.8728880Z 
2025-05-07T20:25:23.0476061Z binutils_impl_linux- | 6.0 MB    | #2         |  13% [A[A[A
2025-05-07T20:25:23.0476369Z 
2025-05-07T20:25:23.0476393Z 
2025-05-07T20:25:23.0476397Z 
2025-05-07T20:25:23.0476401Z 
2025-05-07T20:25:23.0479098Z libstdcxx-15.1.0     | 3.7 MB    | ##3        |  23% [A[A[A[A
2025-05-07T20:25:23.0479381Z 
2025-05-07T20:25:23.0479386Z 
2025-05-07T20:25:23.0479390Z 
2025-05-07T20:25:23.0935789Z binutils_impl_linux- | 6.0 MB    | ##4        |  25% [A[A[A
2025-05-07T20:25:23.0936088Z 
2025-05-07T20:25:23.0936092Z 
2025-05-07T20:25:23.0936747Z 
2025-05-07T20:25:23.1056771Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:23.1057076Z 
2025-05-07T20:25:23.1057080Z 
2025-05-07T20:25:23.1095757Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:23.1096044Z 
2025-05-07T20:25:23.1096048Z 
2025-05-07T20:25:23.1096052Z 
2025-05-07T20:25:23.1096126Z 
2025-05-07T20:25:23.1319366Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:23.1319651Z 
2025-05-07T20:25:23.1319655Z 
2025-05-07T20:25:23.1319666Z 
2025-05-07T20:25:23.1319670Z 
2025-05-07T20:25:23.1321158Z 
2025-05-07T20:25:23.1354147Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:23.1354455Z 
2025-05-07T20:25:23.1564323Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:23.1564616Z 
2025-05-07T20:25:23.1564620Z 
2025-05-07T20:25:23.1564624Z 
2025-05-07T20:25:23.1564628Z 
2025-05-07T20:25:23.1564631Z 
2025-05-07T20:25:23.1564635Z 
2025-05-07T20:25:23.1616390Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:23.2063706Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:23.2064012Z 
2025-05-07T20:25:23.2065640Z 
2025-05-07T20:25:23.2319727Z libstdcxx-devel_linu | 11.1 MB   | ###1       |  32% [A[A
2025-05-07T20:25:23.2320073Z 
2025-05-07T20:25:23.2320077Z 
2025-05-07T20:25:23.2320097Z 
2025-05-07T20:25:23.2320101Z 
2025-05-07T20:25:23.2320104Z 
2025-05-07T20:25:23.2358797Z libsanitizer-11.4.0  | 3.5 MB    | ########5  |  85% [A[A[A[A[A
2025-05-07T20:25:23.2359383Z 
2025-05-07T20:25:23.2619428Z gxx_impl_linux-64-11 | 11.2 MB   | #7         |  17% [A
2025-05-07T20:25:23.3066647Z gcc_impl_linux-64-11 | 53.0 MB   | 4          |   4% 
2025-05-07T20:25:23.3067008Z 
2025-05-07T20:25:23.3067014Z 
2025-05-07T20:25:23.3067020Z 
2025-05-07T20:25:23.3067025Z 
2025-05-07T20:25:23.3067030Z 
2025-05-07T20:25:23.3069399Z 
2025-05-07T20:25:23.3069968Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:23.3070299Z 
2025-05-07T20:25:23.3070305Z 
2025-05-07T20:25:23.3070316Z 
2025-05-07T20:25:23.3070321Z 
2025-05-07T20:25:23.3070326Z 
2025-05-07T20:25:23.3070331Z 
2025-05-07T20:25:23.3070828Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:23.3071130Z 
2025-05-07T20:25:23.3071137Z 
2025-05-07T20:25:23.3359780Z libstdcxx-devel_linu | 11.1 MB   | #####7     |  58% [A[A
2025-05-07T20:25:23.3365178Z 
2025-05-07T20:25:23.3616265Z gxx_impl_linux-64-11 | 11.2 MB   | ####2      |  42% [A
2025-05-07T20:25:23.3616530Z 
2025-05-07T20:25:23.3616789Z 
2025-05-07T20:25:23.3616794Z 
2025-05-07T20:25:23.3616798Z 
2025-05-07T20:25:23.3616802Z 
2025-05-07T20:25:23.3616806Z 
2025-05-07T20:25:23.3620691Z 
2025-05-07T20:25:23.3624415Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:23.3923735Z gcc_impl_linux-64-11 | 53.0 MB   | 8          |   9% 
2025-05-07T20:25:23.3923983Z 
2025-05-07T20:25:23.3923987Z 
2025-05-07T20:25:23.3923990Z 
2025-05-07T20:25:23.3923994Z 
2025-05-07T20:25:23.3927915Z 
2025-05-07T20:25:23.4072711Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:23.4072999Z 
2025-05-07T20:25:23.4075131Z 
2025-05-07T20:25:23.4257283Z libstdcxx-devel_linu | 11.1 MB   | ########3  |  84% [A[A
2025-05-07T20:25:23.4257552Z 
2025-05-07T20:25:23.4257772Z 
2025-05-07T20:25:23.4257776Z 
2025-05-07T20:25:23.4257780Z 
2025-05-07T20:25:23.4257783Z 
2025-05-07T20:25:23.4257787Z 
2025-05-07T20:25:23.4257791Z 
2025-05-07T20:25:23.4259547Z 
2025-05-07T20:25:23.4323475Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:23.4323777Z 
2025-05-07T20:25:23.4323781Z 
2025-05-07T20:25:23.4323785Z 
2025-05-07T20:25:23.4323788Z 
2025-05-07T20:25:23.4323792Z 
2025-05-07T20:25:23.4323796Z 
2025-05-07T20:25:23.4323799Z 
2025-05-07T20:25:23.4323803Z 
2025-05-07T20:25:23.4361335Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:23.4365237Z 
2025-05-07T20:25:23.4388381Z gxx_impl_linux-64-11 | 11.2 MB   | ######3    |  64% [A
2025-05-07T20:25:23.4388639Z 
2025-05-07T20:25:23.4388643Z 
2025-05-07T20:25:23.4388646Z 
2025-05-07T20:25:23.4388650Z 
2025-05-07T20:25:23.4388653Z 
2025-05-07T20:25:23.4388657Z 
2025-05-07T20:25:23.4391876Z 
2025-05-07T20:25:23.4628988Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:23.4686503Z gcc_impl_linux-64-11 | 53.0 MB   | #4         |  14% 
2025-05-07T20:25:23.4686983Z 
2025-05-07T20:25:23.4686991Z 
2025-05-07T20:25:23.4687010Z 
2025-05-07T20:25:23.4687017Z 
2025-05-07T20:25:23.4687024Z 
2025-05-07T20:25:23.4687032Z 
2025-05-07T20:25:23.4687039Z 
2025-05-07T20:25:23.4687047Z 
2025-05-07T20:25:23.4691194Z 
2025-05-07T20:25:23.4745752Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.4746052Z 
2025-05-07T20:25:23.4746056Z 
2025-05-07T20:25:23.4746059Z 
2025-05-07T20:25:23.4746063Z 
2025-05-07T20:25:23.4746067Z 
2025-05-07T20:25:23.4746071Z 
2025-05-07T20:25:23.4746074Z 
2025-05-07T20:25:23.4746078Z 
2025-05-07T20:25:23.4746090Z 
2025-05-07T20:25:23.4873070Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.4873654Z 
2025-05-07T20:25:23.4873662Z 
2025-05-07T20:25:23.4873670Z 
2025-05-07T20:25:23.4873691Z 
2025-05-07T20:25:23.4873713Z 
2025-05-07T20:25:23.4873720Z 
2025-05-07T20:25:23.4873727Z 
2025-05-07T20:25:23.4873734Z 
2025-05-07T20:25:23.4873742Z 
2025-05-07T20:25:23.4875792Z 
2025-05-07T20:25:23.4926836Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.4927156Z 
2025-05-07T20:25:23.4927160Z 
2025-05-07T20:25:23.4927163Z 
2025-05-07T20:25:23.4927167Z 
2025-05-07T20:25:23.4927171Z 
2025-05-07T20:25:23.4927175Z 
2025-05-07T20:25:23.4927178Z 
2025-05-07T20:25:23.4927182Z 
2025-05-07T20:25:23.4927186Z 
2025-05-07T20:25:23.4927189Z 
2025-05-07T20:25:23.5242267Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5242572Z 
2025-05-07T20:25:23.5242575Z 
2025-05-07T20:25:23.5242579Z 
2025-05-07T20:25:23.5242583Z 
2025-05-07T20:25:23.5242586Z 
2025-05-07T20:25:23.5242590Z 
2025-05-07T20:25:23.5242594Z 
2025-05-07T20:25:23.5242597Z 
2025-05-07T20:25:23.5242601Z 
2025-05-07T20:25:23.5242604Z 
2025-05-07T20:25:23.5244302Z 
2025-05-07T20:25:23.5302226Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5302542Z 
2025-05-07T20:25:23.5302546Z 
2025-05-07T20:25:23.5302550Z 
2025-05-07T20:25:23.5302770Z 
2025-05-07T20:25:23.5302775Z 
2025-05-07T20:25:23.5302778Z 
2025-05-07T20:25:23.5302793Z 
2025-05-07T20:25:23.5302796Z 
2025-05-07T20:25:23.5302800Z 
2025-05-07T20:25:23.5302803Z 
2025-05-07T20:25:23.5308122Z 
2025-05-07T20:25:23.5592977Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.5593862Z 
2025-05-07T20:25:23.5629424Z gxx_impl_linux-64-11 | 11.2 MB   | ########2  |  82% [A
2025-05-07T20:25:23.5710516Z gcc_impl_linux-64-11 | 53.0 MB   | ##1        |  21% 
2025-05-07T20:25:23.5710762Z 
2025-05-07T20:25:23.5710766Z 
2025-05-07T20:25:23.5710770Z 
2025-05-07T20:25:23.5712780Z 
2025-05-07T20:25:23.5717771Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:23.5718051Z 
2025-05-07T20:25:23.5718288Z 
2025-05-07T20:25:23.5718292Z 
2025-05-07T20:25:23.5718296Z 
2025-05-07T20:25:23.6631618Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:23.7057764Z gcc_impl_linux-64-11 | 53.0 MB   | ##7        |  28% 
2025-05-07T20:25:23.7058020Z 
2025-05-07T20:25:23.7058024Z 
2025-05-07T20:25:23.7058027Z 
2025-05-07T20:25:23.7058031Z 
2025-05-07T20:25:23.7058035Z 
2025-05-07T20:25:23.7059640Z 
2025-05-07T20:25:23.7634877Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:23.8047175Z gcc_impl_linux-64-11 | 53.0 MB   | ###4       |  35% 
2025-05-07T20:25:23.8047457Z 
2025-05-07T20:25:23.8047576Z 
2025-05-07T20:25:23.8400828Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:23.8401112Z 
2025-05-07T20:25:23.8401116Z 
2025-05-07T20:25:23.8401119Z 
2025-05-07T20:25:23.8401123Z 
2025-05-07T20:25:23.8401128Z 
2025-05-07T20:25:23.8401132Z 
2025-05-07T20:25:23.8401135Z 
2025-05-07T20:25:23.8401477Z 
2025-05-07T20:25:23.8412006Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:23.8412335Z 
2025-05-07T20:25:23.8412340Z 
2025-05-07T20:25:23.8412343Z 
2025-05-07T20:25:23.8412347Z 
2025-05-07T20:25:23.8412364Z 
2025-05-07T20:25:23.8412368Z 
2025-05-07T20:25:23.8412372Z 
2025-05-07T20:25:23.8413007Z 
2025-05-07T20:25:23.8635766Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:23.8738322Z gcc_impl_linux-64-11 | 53.0 MB   | ####1      |  42% 
2025-05-07T20:25:23.8738669Z 
2025-05-07T20:25:23.8738990Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:23.8739262Z 
2025-05-07T20:25:23.9077506Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:23.9077784Z 
2025-05-07T20:25:23.9077788Z 
2025-05-07T20:25:23.9077792Z 
2025-05-07T20:25:23.9077795Z 
2025-05-07T20:25:23.9077799Z 
2025-05-07T20:25:23.9077805Z 
2025-05-07T20:25:23.9077809Z 
2025-05-07T20:25:23.9084095Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:23.9084417Z 
2025-05-07T20:25:23.9084421Z 
2025-05-07T20:25:23.9084425Z 
2025-05-07T20:25:23.9084429Z 
2025-05-07T20:25:23.9084433Z 
2025-05-07T20:25:23.9084446Z 
2025-05-07T20:25:23.9084620Z 
2025-05-07T20:25:23.9494912Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:23.9495288Z 
2025-05-07T20:25:23.9495292Z 
2025-05-07T20:25:23.9495296Z 
2025-05-07T20:25:23.9495299Z 
2025-05-07T20:25:23.9495303Z 
2025-05-07T20:25:23.9495307Z 
2025-05-07T20:25:23.9495311Z 
2025-05-07T20:25:23.9495315Z 
2025-05-07T20:25:23.9495319Z 
2025-05-07T20:25:23.9498387Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.9498684Z 
2025-05-07T20:25:23.9498688Z 
2025-05-07T20:25:23.9498691Z 
2025-05-07T20:25:23.9498695Z 
2025-05-07T20:25:23.9498699Z 
2025-05-07T20:25:23.9498702Z 
2025-05-07T20:25:23.9498715Z 
2025-05-07T20:25:23.9498719Z 
2025-05-07T20:25:23.9498722Z 
2025-05-07T20:25:23.9866869Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.9867170Z 
2025-05-07T20:25:23.9867182Z 
2025-05-07T20:25:23.9867186Z 
2025-05-07T20:25:23.9867190Z 
2025-05-07T20:25:23.9867465Z 
2025-05-07T20:25:23.9867470Z 
2025-05-07T20:25:23.9867474Z 
2025-05-07T20:25:23.9867477Z 
2025-05-07T20:25:23.9867481Z 
2025-05-07T20:25:23.9867484Z 
2025-05-07T20:25:23.9868371Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.9868667Z 
2025-05-07T20:25:23.9868671Z 
2025-05-07T20:25:23.9868674Z 
2025-05-07T20:25:23.9868678Z 
2025-05-07T20:25:23.9868681Z 
2025-05-07T20:25:23.9868685Z 
2025-05-07T20:25:23.9868688Z 
2025-05-07T20:25:23.9868692Z 
2025-05-07T20:25:23.9868695Z 
2025-05-07T20:25:23.9868699Z 
2025-05-07T20:25:23.9967127Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:23.9985501Z gcc_impl_linux-64-11 | 53.0 MB   | ####7      |  48% 
2025-05-07T20:25:23.9986093Z 
2025-05-07T20:25:23.9986096Z 
2025-05-07T20:25:23.9986100Z 
2025-05-07T20:25:23.9986103Z 
2025-05-07T20:25:23.9987056Z 
2025-05-07T20:25:24.0356977Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:24.0357581Z 
2025-05-07T20:25:24.0357589Z 
2025-05-07T20:25:24.0357596Z 
2025-05-07T20:25:24.0357603Z 
2025-05-07T20:25:24.0357610Z 
2025-05-07T20:25:24.0357618Z 
2025-05-07T20:25:24.0357625Z 
2025-05-07T20:25:24.0357632Z 
2025-05-07T20:25:24.0357639Z 
2025-05-07T20:25:24.0357646Z 
2025-05-07T20:25:24.0357653Z 
2025-05-07T20:25:24.0361187Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.0361519Z 
2025-05-07T20:25:24.0361523Z 
2025-05-07T20:25:24.0361526Z 
2025-05-07T20:25:24.0361530Z 
2025-05-07T20:25:24.0361534Z 
2025-05-07T20:25:24.0361537Z 
2025-05-07T20:25:24.0361541Z 
2025-05-07T20:25:24.0361544Z 
2025-05-07T20:25:24.0361548Z 
2025-05-07T20:25:24.0361551Z 
2025-05-07T20:25:24.0361555Z 
2025-05-07T20:25:24.0970577Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:24.1783857Z gcc_impl_linux-64-11 | 53.0 MB   | #####5     |  55% 
2025-05-07T20:25:24.1784121Z 
2025-05-07T20:25:24.1784141Z 
2025-05-07T20:25:24.1785436Z 
2025-05-07T20:25:24.1798181Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:24.1798557Z 
2025-05-07T20:25:24.1798574Z 
2025-05-07T20:25:24.1799057Z 
2025-05-07T20:25:24.1973685Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:24.3049078Z gcc_impl_linux-64-11 | 53.0 MB   | ######5    |  65% 
2025-05-07T20:25:24.3743274Z gcc_impl_linux-64-11 | 53.0 MB   | #######5   |  76% 
2025-05-07T20:25:24.3743547Z 
2025-05-07T20:25:24.4049656Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:24.5050191Z gcc_impl_linux-64-11 | 53.0 MB   | ########5  |  85% 
2025-05-07T20:25:24.5885872Z gcc_impl_linux-64-11 | 53.0 MB   | #########4 |  94% 
2025-05-07T20:25:24.5886123Z 
2025-05-07T20:25:24.5886420Z 
2025-05-07T20:25:24.6222882Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:25.2124497Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:25.2130814Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:25.2131454Z                                                      
2025-05-07T20:25:25.2131839Z 
2025-05-07T20:25:25.2132327Z                                                      [A
2025-05-07T20:25:25.2132694Z 
2025-05-07T20:25:25.2132700Z 
2025-05-07T20:25:25.2133012Z                                                      [A[A
2025-05-07T20:25:25.2133386Z 
2025-05-07T20:25:25.2133393Z 
2025-05-07T20:25:25.2133398Z 
2025-05-07T20:25:25.2133690Z                                                      [A[A[A
2025-05-07T20:25:25.2134048Z 
2025-05-07T20:25:25.2134054Z 
2025-05-07T20:25:25.2134059Z 
2025-05-07T20:25:25.2134065Z 
2025-05-07T20:25:25.2134360Z                                                      [A[A[A[A
2025-05-07T20:25:25.2134701Z 
2025-05-07T20:25:25.2134708Z 
2025-05-07T20:25:25.2134713Z 
2025-05-07T20:25:25.2134718Z 
2025-05-07T20:25:25.2134722Z 
2025-05-07T20:25:25.2135303Z                                                      [A[A[A[A[A
2025-05-07T20:25:25.2135656Z 
2025-05-07T20:25:25.2135661Z 
2025-05-07T20:25:25.2135666Z 
2025-05-07T20:25:25.2135672Z 
2025-05-07T20:25:25.2135677Z 
2025-05-07T20:25:25.2135682Z 
2025-05-07T20:25:25.2135912Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:25.2136187Z 
2025-05-07T20:25:25.2136192Z 
2025-05-07T20:25:25.2136197Z 
2025-05-07T20:25:25.2136202Z 
2025-05-07T20:25:25.2136207Z 
2025-05-07T20:25:25.2136213Z 
2025-05-07T20:25:25.2136218Z 
2025-05-07T20:25:25.2136485Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:25.2136825Z 
2025-05-07T20:25:25.2136832Z 
2025-05-07T20:25:25.2136838Z 
2025-05-07T20:25:25.2136844Z 
2025-05-07T20:25:25.2136849Z 
2025-05-07T20:25:25.2136854Z 
2025-05-07T20:25:25.2137080Z 
2025-05-07T20:25:25.2137085Z 
2025-05-07T20:25:25.2137394Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:25.2137753Z 
2025-05-07T20:25:25.2137771Z 
2025-05-07T20:25:25.2137777Z 
2025-05-07T20:25:25.2137782Z 
2025-05-07T20:25:25.2137788Z 
2025-05-07T20:25:25.2137793Z 
2025-05-07T20:25:25.2137799Z 
2025-05-07T20:25:25.2137805Z 
2025-05-07T20:25:25.2137811Z 
2025-05-07T20:25:25.2138142Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:25.2138519Z 
2025-05-07T20:25:25.2138524Z 
2025-05-07T20:25:25.2138529Z 
2025-05-07T20:25:25.2138534Z 
2025-05-07T20:25:25.2138540Z 
2025-05-07T20:25:25.2138545Z 
2025-05-07T20:25:25.2138550Z 
2025-05-07T20:25:25.2138555Z 
2025-05-07T20:25:25.2138561Z 
2025-05-07T20:25:25.2138566Z 
2025-05-07T20:25:25.2138870Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:25.2139282Z 
2025-05-07T20:25:25.2139290Z 
2025-05-07T20:25:25.2139307Z 
2025-05-07T20:25:25.2139315Z 
2025-05-07T20:25:25.2139323Z 
2025-05-07T20:25:25.2139330Z 
2025-05-07T20:25:25.2139339Z 
2025-05-07T20:25:25.2139346Z 
2025-05-07T20:25:25.2139362Z 
2025-05-07T20:25:25.2139377Z 
2025-05-07T20:25:25.2139384Z 
2025-05-07T20:25:25.2139734Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:25.3142128Z Preparing transaction: \ done
2025-05-07T20:25:25.6150676Z Verifying transaction: / - \ done
2025-05-07T20:25:25.7160774Z Executing transaction: / done
2025-05-07T20:25:25.8803379Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:29.7779111Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:29.7779683Z 
2025-05-07T20:25:29.7790494Z 
2025-05-07T20:25:29.7809436Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:29.7810016Z 
2025-05-07T20:25:29.7822339Z 
2025-05-07T20:25:29.7839537Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:29.7840092Z 
2025-05-07T20:25:29.7851312Z 
2025-05-07T20:25:29.7868404Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:29.7868958Z 
2025-05-07T20:25:29.7880979Z 
2025-05-07T20:25:31.6705477Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:31.6705777Z 
2025-05-07T20:25:31.7322669Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:33.6176799Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:33.6177093Z 
2025-05-07T20:25:33.6800758Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:35.5582714Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:35.5583069Z 
2025-05-07T20:25:35.6204242Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:37.4966262Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:37.4966562Z 
2025-05-07T20:25:37.5597541Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:37.5601846Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:37.5602314Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:37.5602533Z 
2025-05-07T20:25:39.4527278Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:39.4527737Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:39.4528170Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:39.4528541Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:39.4528897Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:39.4529269Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:39.4529570Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:39.4529983Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:39.4530388Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:39.4530681Z #define __CHAR_BIT__ 8
2025-05-07T20:25:39.4531481Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:39.4531844Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:39.4532210Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:39.4532554Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:39.4532853Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:39.4533173Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4533489Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:39.4533798Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:39.4534144Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:39.4534481Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:39.4534909Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:39.4535347Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:39.4535679Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:39.4535970Z #define __GCC_IEC_559 2
2025-05-07T20:25:39.4536229Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:39.4536525Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:39.4536798Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:39.4537160Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:39.4537509Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4537844Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:39.4538128Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:39.4538416Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:39.4538694Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:39.4538968Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:39.4539243Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:39.4539522Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:39.4539793Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:39.4540056Z #define __INT8_C(c) c
2025-05-07T20:25:39.4540306Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:39.4540614Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4540950Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:39.4541282Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:39.4541654Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:39.4541939Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:39.4542224Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4542512Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:39.4542806Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:39.4543217Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:39.4543650Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:39.4543947Z #define __linux 1
2025-05-07T20:25:39.4544189Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:39.4544485Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:39.4544776Z #define __unix 1
2025-05-07T20:25:39.4545020Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:39.4545315Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:39.4545595Z #define __WINT_MIN__ 0U
2025-05-07T20:25:39.4545855Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:39.4546163Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:39.4546445Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:39.4546727Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:39.4547194Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:39.4547496Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:39.4547809Z #define __INT64_C(c) c ## L
2025-05-07T20:25:39.4548093Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:39.4548407Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:39.4548683Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:39.4549059Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:39.4549452Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:39.4549714Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:39.4549990Z #define __DBL_DIG__ 15
2025-05-07T20:25:39.4550234Z #define __FLT32_DIG__ 6
2025-05-07T20:25:39.4550546Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:39.4550911Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:39.4551261Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:39.4551600Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:39.4551965Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:39.4552227Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:39.4552502Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:39.4552889Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:39.4553303Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:39.4553590Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:39.4553853Z #define __unix__ 1
2025-05-07T20:25:39.4554085Z #define __INT_WIDTH__ 32
2025-05-07T20:25:39.4554341Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:39.4554592Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:39.4554859Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:39.4555167Z #define __UINT16_C(c) c
2025-05-07T20:25:39.4555435Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:39.4555702Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:39.4556083Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:39.4556455Z #define __gnu_linux__ 1
2025-05-07T20:25:39.4556712Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:39.4557006Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:39.4557307Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4557584Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:39.4557859Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:39.4558121Z #define __GNUC__ 11
2025-05-07T20:25:39.4558341Z #define __pie__ 2
2025-05-07T20:25:39.4558563Z #define __MMX__ 1
2025-05-07T20:25:39.4558797Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:39.4559070Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:39.4559365Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:39.4559653Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:39.4560010Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:39.4560585Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4560923Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:39.4561192Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:39.4561477Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:39.4561816Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:39.4562088Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:39.4562363Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:39.4562663Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:39.4562971Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:39.4563253Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:39.4563548Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:39.4563817Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:39.4564090Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:39.4572906Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:39.4573203Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:39.4573479Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:39.4573827Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:39.4574223Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:39.4574520Z #define __SSE2_MATH__ 1
2025-05-07T20:25:39.4574782Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:39.4575300Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4575620Z #define __amd64 1
2025-05-07T20:25:39.4575859Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:39.4576152Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:39.4576484Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:39.4576812Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:39.4577086Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:39.4577385Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:39.4577656Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:39.4577944Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:39.4578235Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:39.4578514Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:39.4578804Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:39.4579103Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:39.4579470Z #define __x86_64 1
2025-05-07T20:25:39.4579715Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:39.4580121Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:39.4580617Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:39.4581093Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:39.4581588Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:39.4582000Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:39.4582266Z #define __LP64__ 1
2025-05-07T20:25:39.4582513Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4582889Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:39.4583290Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:39.4583579Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:39.4583875Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:39.4584187Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:39.4584478Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:39.4584764Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:39.4585057Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:39.4585368Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:39.4585663Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:39.4586016Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:39.4586393Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:39.4586689Z #define __FLT_DIG__ 6
2025-05-07T20:25:39.4586942Z #define __NO_INLINE__ 1
2025-05-07T20:25:39.4587195Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:39.4587543Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:39.4587914Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:39.4588192Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:39.4588471Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:39.4588745Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:39.4589026Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:39.4589294Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:39.4589611Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:39.4589922Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:39.4590202Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:39.4590530Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:39.4590883Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:39.4591161Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:39.4591439Z #define __FLT128_DIG__ 33
2025-05-07T20:25:39.4591697Z #define __INT32_C(c) c
2025-05-07T20:25:39.4591951Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:39.4592253Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:39.4592551Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:39.4592860Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:39.4593189Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:39.4593524Z #define unix 1
2025-05-07T20:25:39.4593773Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:39.4594109Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4594437Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:39.4594898Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:39.4595247Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:39.4595520Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:39.4595804Z #define __ELF__ 1
2025-05-07T20:25:39.4596047Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:39.4596355Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:39.4596651Z #define __FLT_RADIX__ 2
2025-05-07T20:25:39.4596911Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:39.4597297Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:39.4597684Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:39.4597961Z #define __SSE_MATH__ 1
2025-05-07T20:25:39.4598199Z #define __k8 1
2025-05-07T20:25:39.4598521Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:39.4599010Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:39.4599321Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:39.4599641Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:39.4599929Z #define __LDBL_DIG__ 18
2025-05-07T20:25:39.4600336Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:39.4600663Z #define __x86_64__ 1
2025-05-07T20:25:39.4600918Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:39.4601231Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:39.4601590Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4601926Z #define __FLT64_DIG__ 15
2025-05-07T20:25:39.4602231Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4602598Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:39.4602938Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4603227Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:39.4603517Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4603839Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:39.4604235Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:39.4604650Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:39.4604973Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:39.4605338Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:39.4605677Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:39.4605997Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:39.4606304Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:39.4606636Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:39.4606930Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:39.4607189Z #define __SEG_FS 1
2025-05-07T20:25:39.4607444Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:39.4607739Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:39.4608038Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4608347Z #define __SEG_GS 1
2025-05-07T20:25:39.4608680Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:39.4609089Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:39.4609383Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:39.4609691Z #define __INT16_TYPE__ short int
2025-05-07T20:25:39.4610000Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:39.4610315Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:39.4610592Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:39.4610859Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:39.4611147Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:39.4611519Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:39.4611922Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4612238Z #define linux 1
2025-05-07T20:25:39.4612482Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4612774Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:39.4613067Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:39.4614052Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:39.4614429Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:39.4614774Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:39.4615182Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:39.4615937Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:39.4616293Z #define __code_model_small__ 1
2025-05-07T20:25:39.4616588Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:39.4616891Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:39.4617145Z #define __k8__ 1
2025-05-07T20:25:39.4617389Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:39.4617690Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:39.4617998Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:39.4618254Z #define __pic__ 2
2025-05-07T20:25:39.4618526Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4618849Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:39.4619157Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4619508Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:39.4619892Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:39.4620424Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:39.4620718Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:39.4621034Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:39.4621364Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:39.4621637Z #define __linux__ 1
2025-05-07T20:25:39.4621880Z #define __INT64_TYPE__ long int
2025-05-07T20:25:39.4622156Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:39.4622436Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:39.4622727Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:39.4622994Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:39.4623311Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4623663Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:39.4623976Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:39.4624265Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:39.4624579Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:39.4624892Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:39.4625249Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:39.4625624Z #define __SSE__ 1
2025-05-07T20:25:39.4625867Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:39.4626227Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:39.4626591Z #define __amd64__ 1
2025-05-07T20:25:39.4626829Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:39.4627090Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:39.4627374Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:39.4627661Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:39.4627938Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:39.4628234Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:39.4628509Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:39.4628798Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:39.4629082Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:39.4629453Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:39.4629942Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:39.4630314Z #define _LP64 1
2025-05-07T20:25:39.4630547Z #define __UINT8_C(c) c
2025-05-07T20:25:39.4630806Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:39.4631077Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:39.4631366Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:39.4631654Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:39.4631969Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:39.4632341Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:39.4632826Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:39.4633209Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4633520Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:39.4633866Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:39.4634252Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:39.4634636Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:39.4634918Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:39.4635380Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:39.4635761Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:39.4636041Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:39.4636307Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:39.4636568Z #define __FXSR__ 1
2025-05-07T20:25:39.4636875Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:39.4637352Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:39.4637785Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:39.4638101Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:39.4638378Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:39.4638730Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:39.4639100Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:39.4639452Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:39.4639703Z #define __PIC__ 2
2025-05-07T20:25:39.4639962Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:39.4640530Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:39.4640935Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:39.4641287Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:39.4641622Z #define __SSE2__ 1
2025-05-07T20:25:39.4641861Z #define __INT32_TYPE__ int
2025-05-07T20:25:39.4642123Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:39.4642389Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:39.4642743Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:39.4643113Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:39.4643393Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:39.4643679Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:39.4643960Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4644247Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:39.4644510Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:39.4644769Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:39.4645103Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4645435Z #define __PIE__ 2
2025-05-07T20:25:39.4645773Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:39.4646180Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:39.4646537Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:39.4646913Z #define __INT16_C(c) c
2025-05-07T20:25:39.4647150Z #define __STDC__ 1
2025-05-07T20:25:39.4647389Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:39.4647676Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:39.4647950Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:39.4648260Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:39.4648631Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:39.4648981Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:39.4649270Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:39.4649558Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:39.4649838Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:39.4650140Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:39.4650443Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:39.4650734Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:39.4651048Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:39.4651457Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:39.4651855Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:39.4652175Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:39.4652480Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:39.4652745Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:39.4652914Z 
2025-05-07T20:25:39.5152798Z 
2025-05-07T20:25:39.5153554Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:39.5154124Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:39.5154390Z 
2025-05-07T20:25:41.4046338Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:41.4046752Z #define __cpp_attributes 200809L
2025-05-07T20:25:41.4047624Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:41.4048138Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:41.4048441Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:41.4048716Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:41.4049061Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:41.4049461Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:41.4049905Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:41.4050341Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:41.4050736Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:41.4051015Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:41.4051277Z #define __CHAR_BIT__ 8
2025-05-07T20:25:41.4051524Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:41.4051792Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:41.4052234Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:41.4052524Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:41.4052817Z #define __cpp_static_assert 201411L
2025-05-07T20:25:41.4053130Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:41.4053443Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4053763Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:41.4054070Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:41.4054409Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:41.4054749Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:41.4055168Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:41.4055591Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:41.4055917Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:41.4056209Z #define __GCC_IEC_559 2
2025-05-07T20:25:41.4056462Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:41.4056752Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:41.4057045Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:41.4057347Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:41.4057648Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:41.4057988Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:41.4058321Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:41.4058665Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4059003Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:41.4059289Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:41.4059571Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:41.4059862Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:41.4060174Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:41.4060449Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:41.4060726Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:41.4061015Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:41.4061364Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:41.4061713Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:41.4061984Z #define __INT8_C(c) c
2025-05-07T20:25:41.4062233Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:41.4062521Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:41.4062862Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4063204Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:41.4063490Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:41.4063799Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:41.4064130Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:41.4064498Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:41.4064799Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:41.4065096Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:41.4065373Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4065703Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:41.4066012Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:41.4066426Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:41.4066858Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:41.4067161Z #define __linux 1
2025-05-07T20:25:41.4067510Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:41.4067801Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:41.4068096Z #define __unix 1
2025-05-07T20:25:41.4068336Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:41.4068630Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:41.4068935Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:41.4069222Z #define __WINT_MIN__ 0U
2025-05-07T20:25:41.4069472Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:41.4069767Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:41.4070055Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:41.4070328Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:41.4070599Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:41.4070896Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:41.4071207Z #define __INT64_C(c) c ## L
2025-05-07T20:25:41.4071568Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:41.4071880Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:41.4072175Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:41.4072488Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:41.4072781Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:41.4073059Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:41.4073418Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:41.4073812Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:41.4074079Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:41.4074366Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:41.4074654Z #define __DBL_DIG__ 15
2025-05-07T20:25:41.4074896Z #define __FLT32_DIG__ 6
2025-05-07T20:25:41.4075212Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:41.4075566Z #define __GXX_WEAK__ 1
2025-05-07T20:25:41.4075816Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:41.4076081Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:41.4076425Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:41.4076789Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:41.4077070Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:41.4077378Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:41.4077728Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:41.4078155Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:41.4078562Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:41.4078853Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:41.4079129Z #define __unix__ 1
2025-05-07T20:25:41.4079363Z #define __INT_WIDTH__ 32
2025-05-07T20:25:41.4079614Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:41.4079877Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:41.4080312Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:41.4080622Z #define __UINT16_C(c) c
2025-05-07T20:25:41.4080881Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:41.4081161Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:41.4081533Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:41.4081908Z #define __gnu_linux__ 1
2025-05-07T20:25:41.4082166Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:41.4082444Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:41.4082733Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:41.4083035Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4083322Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:41.4083598Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:41.4083859Z #define __GNUC__ 11
2025-05-07T20:25:41.4092411Z #define __GXX_RTTI 1
2025-05-07T20:25:41.4092668Z #define __pie__ 2
2025-05-07T20:25:41.4092893Z #define __MMX__ 1
2025-05-07T20:25:41.4093138Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:41.4093432Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:41.4093730Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:41.4094018Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:41.4094298Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:41.4094612Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:41.4094957Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:41.4095564Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:41.4095968Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:41.4096288Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4096626Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:41.4096906Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:41.4097184Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:41.4097513Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:41.4097828Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:41.4098104Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:41.4098383Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:41.4098689Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:41.4098997Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:41.4099287Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:41.4099680Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:41.4099947Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:41.4100232Z #define __cplusplus 201703L
2025-05-07T20:25:41.4100528Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:41.4100832Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:41.4101101Z #define __DEPRECATED 1
2025-05-07T20:25:41.4101379Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:41.4101701Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:41.4101969Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:41.4102312Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:41.4102697Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:41.4102979Z #define __SSE2_MATH__ 1
2025-05-07T20:25:41.4103244Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:41.4103565Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4103869Z #define __amd64 1
2025-05-07T20:25:41.4104110Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:41.4104404Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:41.4104677Z #define __GNUG__ 11
2025-05-07T20:25:41.4104949Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:41.4105287Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:41.4105553Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:41.4105827Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:41.4106122Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:41.4106389Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:41.4106675Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:41.4106989Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:41.4107267Z #define __cpp_hex_float 201603L
2025-05-07T20:25:41.4107543Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:41.4107828Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:41.4108120Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:41.4108399Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:41.4108685Z #define __x86_64 1
2025-05-07T20:25:41.4108926Z #define __cpp_lambdas 200907L
2025-05-07T20:25:41.4109211Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:41.4109610Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:41.4110025Z #define __cpp_template_auto 201606L
2025-05-07T20:25:41.4110407Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:41.4110876Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:41.4111370Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:41.4111775Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:41.4112035Z #define __LP64__ 1
2025-05-07T20:25:41.4112279Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4112650Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:41.4113040Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:41.4114013Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:41.4114437Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:41.4114833Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:41.4115223Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:41.4115564Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:41.4116165Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:41.4116517Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:41.4116898Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:41.4117196Z #define __FLT_DIG__ 6
2025-05-07T20:25:41.4117438Z #define __NO_INLINE__ 1
2025-05-07T20:25:41.4117699Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:41.4118049Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:41.4118418Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:41.4118693Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:41.4118975Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:41.4119239Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:41.4119532Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:41.4119852Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:41.4120119Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:41.4120667Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:41.4120975Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:41.4121265Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:41.4121578Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:41.4121942Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:41.4122249Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:41.4122524Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:41.4122799Z #define __FLT128_DIG__ 33
2025-05-07T20:25:41.4123056Z #define __INT32_C(c) c
2025-05-07T20:25:41.4123307Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:41.4123610Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:41.4123908Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:41.4124201Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:41.4124538Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:41.4124867Z #define unix 1
2025-05-07T20:25:41.4125097Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:41.4125382Z #define __cpp_rtti 199711L
2025-05-07T20:25:41.4125669Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:41.4126008Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4126332Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:41.4126673Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:41.4127028Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:41.4127294Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:41.4127607Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:41.4127908Z #define __ELF__ 1
2025-05-07T20:25:41.4128155Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:41.4128463Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:41.4128758Z #define __FLT_RADIX__ 2
2025-05-07T20:25:41.4129019Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:41.4129401Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:41.4129791Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:41.4130090Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:41.4130380Z #define __k8 1
2025-05-07T20:25:41.4130698Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:41.4131095Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:41.4131403Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:41.4131724Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:41.4132003Z #define __LDBL_DIG__ 18
2025-05-07T20:25:41.4132257Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:41.4132534Z #define __x86_64__ 1
2025-05-07T20:25:41.4132789Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:41.4133105Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:41.4133466Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4133794Z #define __FLT64_DIG__ 15
2025-05-07T20:25:41.4134156Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4134525Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:41.4134853Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4135143Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:41.4135441Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4135785Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:41.4136290Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:41.4136716Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:41.4137025Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:41.4137359Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:41.4137699Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:41.4138046Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:41.4138358Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:41.4138659Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:41.4138989Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:41.4139281Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:41.4139544Z #define __SEG_FS 1
2025-05-07T20:25:41.4139792Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:41.4140162Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:41.4140463Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4140775Z #define __SEG_GS 1
2025-05-07T20:25:41.4141114Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:41.4141512Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:41.4141805Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:41.4142113Z #define __INT16_TYPE__ short int
2025-05-07T20:25:41.4142406Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:41.4142743Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:41.4143061Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:41.4143322Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:41.4143604Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:41.4143973Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:41.4144377Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4144721Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:41.4145076Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:41.4145402Z #define linux 1
2025-05-07T20:25:41.4145637Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4145955Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:41.4146289Z #define __EXCEPTIONS 1
2025-05-07T20:25:41.4146545Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:41.4146826Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:41.4147117Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:41.4147422Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:41.4147792Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:41.4148209Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:41.4148574Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:41.4148928Z #define __code_model_small__ 1
2025-05-07T20:25:41.4149222Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:41.4149550Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:41.4149882Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:41.4150179Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:41.4150497Z #define __k8__ 1
2025-05-07T20:25:41.4150735Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:41.4151048Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:41.4151369Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:41.4151623Z #define __pic__ 2
2025-05-07T20:25:41.4151893Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4152226Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:41.4152516Z #define __cpp_decltype 200707L
2025-05-07T20:25:41.4152831Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4153184Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:41.4153573Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:41.4153947Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:41.4154268Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:41.4154609Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:41.4154918Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:41.4155190Z #define __linux__ 1
2025-05-07T20:25:41.4155433Z #define __INT64_TYPE__ long int
2025-05-07T20:25:41.4155806Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:41.4156094Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:41.4156387Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:41.4156694Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:41.4157025Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:41.4157339Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4157675Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:41.4157955Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:41.4158268Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:41.4158589Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:41.4158933Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:41.4159310Z #define __SSE__ 1
2025-05-07T20:25:41.4159556Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:41.4159997Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:41.4160483Z #define __amd64__ 1
2025-05-07T20:25:41.4160729Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:41.4161005Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:41.4161292Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:41.4161576Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:41.4161869Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:41.4162142Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:41.4162432Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:41.4162713Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:41.4163073Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:41.4163560Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:41.4163933Z #define _LP64 1
2025-05-07T20:25:41.4164156Z #define __UINT8_C(c) c
2025-05-07T20:25:41.4164413Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:41.4164703Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:41.4164988Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:41.4165271Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:41.4165658Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:41.4166149Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:41.4166540Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4166855Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:41.4167194Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:41.4167520Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:41.4167929Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:41.4168321Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:41.4168598Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:41.4168879Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:41.4169249Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:41.4169642Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:41.4169915Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:41.4170181Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:41.4170449Z #define __FXSR__ 1
2025-05-07T20:25:41.4170766Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:41.4171245Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:41.4171673Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:41.4171996Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:41.4172279Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:41.4172599Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:41.4172909Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:41.4173200Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:41.4173583Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:41.4173967Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:41.4174245Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:41.4174513Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:41.4174769Z #define __PIC__ 2
2025-05-07T20:25:41.4175029Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:41.4175595Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:41.4176006Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:41.4176352Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:41.4176715Z #define __cpp_constexpr 201603L
2025-05-07T20:25:41.4176990Z #define __SSE2__ 1
2025-05-07T20:25:41.4177235Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:41.4177542Z #define __INT32_TYPE__ int
2025-05-07T20:25:41.4177811Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:41.4178087Z #define __cpp_exceptions 199711L
2025-05-07T20:25:41.4178377Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:41.4178729Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:41.4179104Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:41.4179476Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:41.4179761Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:41.4180047Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4180339Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:41.4180603Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:41.4180877Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:41.4181182Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:41.4181489Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4181806Z #define __PIE__ 2
2025-05-07T20:25:41.4182142Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:41.4182577Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:41.4182904Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:41.4183273Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:41.4183653Z #define __INT16_C(c) c
2025-05-07T20:25:41.4183896Z #define __STDC__ 1
2025-05-07T20:25:41.4184131Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:41.4184403Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:41.4184697Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:41.4184971Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:41.4185291Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:41.4185663Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:41.4186015Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:41.4186293Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:41.4186600Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:41.4186898Z #define __SSE_MATH__ 1
2025-05-07T20:25:41.4187155Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:41.4187451Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:41.4187777Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:41.4188078Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:41.4188386Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:41.4188676Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:41.4188990Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:41.4189408Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:41.4189801Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:41.4190130Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:41.4190432Z #define _GNU_SOURCE 1
2025-05-07T20:25:41.4190693Z #define __cpp_init_captures 201304L
2025-05-07T20:25:41.4190991Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:41.4191251Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:41.4191426Z 
2025-05-07T20:25:41.4669841Z 
2025-05-07T20:25:41.4670540Z + conda run -n build_binary c++ --version
2025-05-07T20:25:41.4670880Z 
2025-05-07T20:25:43.3502838Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:43.3503370Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:43.3503840Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:43.3504399Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:43.3504773Z 
2025-05-07T20:25:43.3504777Z 
2025-05-07T20:25:43.4118070Z 
2025-05-07T20:25:43.4118426Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:43.4119475Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:43.4119810Z 
2025-05-07T20:25:45.3644334Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:45.3647321Z 
2025-05-07T20:25:45.3647684Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:45.3648394Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:45.3648717Z 
2025-05-07T20:25:47.3128849Z #define __cplusplus 201703L
2025-05-07T20:25:47.3130978Z 
2025-05-07T20:25:47.3131578Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:47.3176995Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:47.3177462Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:47.3189724Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:47.3190082Z env:
2025-05-07T20:25:47.3190307Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:47.3190620Z   BUILD_ENV: build_binary
2025-05-07T20:25:47.3190882Z   BUILD_TARGET: genai
2025-05-07T20:25:47.3191116Z   BUILD_VARIANT: cuda
2025-05-07T20:25:47.3191358Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:47.3191622Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:47.3191930Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:47.3192280Z ##[endgroup]
2025-05-07T20:25:47.6500865Z ################################################################################
2025-05-07T20:25:47.6501381Z # Install CUDA
2025-05-07T20:25:47.6501679Z #
2025-05-07T20:25:47.6517678Z # [2025-05-07T20:25:47.651Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:47.6518237Z ################################################################################
2025-05-07T20:25:47.6518559Z 
2025-05-07T20:25:47.6534587Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:47.7421187Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:47.7421579Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:47.7426588Z + conda clean --packages --tarball -y
2025-05-07T20:25:47.7426798Z 
2025-05-07T20:25:48.4479259Z Will remove 29 (113.6 MB) tarball(s).
2025-05-07T20:25:48.4479815Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:48.5133660Z 
2025-05-07T20:25:48.5142184Z + conda clean --all -y
2025-05-07T20:25:48.5142397Z 
2025-05-07T20:25:49.1841527Z There are no unused tarball(s) to remove.
2025-05-07T20:25:49.1842038Z Will remove 1 index cache(s).
2025-05-07T20:25:49.1842438Z There are no unused package(s) to remove.
2025-05-07T20:25:49.1842851Z There are no tempfile(s) to remove.
2025-05-07T20:25:49.1843261Z There are no logfile(s) to remove.
2025-05-07T20:25:49.2460535Z 
2025-05-07T20:25:49.2473475Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:49.2497618Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:25:50.1538831Z Channels:
2025-05-07T20:25:50.1539131Z  - conda-forge
2025-05-07T20:25:50.1539475Z Platform: linux-64
2025-05-07T20:26:00.7925002Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:26:01.9179729Z Solving environment: - \ | / done
2025-05-07T20:26:01.9928136Z 
2025-05-07T20:26:01.9928730Z ## Package Plan ##
2025-05-07T20:26:01.9928972Z 
2025-05-07T20:26:01.9929272Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:01.9929705Z 
2025-05-07T20:26:01.9929842Z   added / updated specs:
2025-05-07T20:26:01.9930103Z     - cuda=12.8.0
2025-05-07T20:26:01.9930240Z 
2025-05-07T20:26:01.9930280Z 
2025-05-07T20:26:01.9930404Z The following packages will be downloaded:
2025-05-07T20:26:01.9930626Z 
2025-05-07T20:26:01.9930750Z     package                    |            build
2025-05-07T20:26:01.9931083Z     ---------------------------|-----------------
2025-05-07T20:26:01.9931470Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:26:01.9931914Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:26:01.9932405Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:26:01.9932840Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:26:01.9933262Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:26:01.9933704Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:26:01.9934610Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:01.9935131Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:26:01.9935679Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:26:01.9936365Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:26:01.9936830Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:01.9937301Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:26:01.9937809Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:26:01.9938320Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:01.9938849Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:26:01.9939375Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:26:01.9939873Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:26:01.9940333Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:26:01.9940803Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:26:01.9941276Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:26:01.9941747Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:01.9942251Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:26:01.9942732Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:26:01.9943187Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:01.9943669Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:01.9944155Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:26:01.9944600Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:26:01.9945075Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:26:01.9945564Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:26:01.9946039Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:26:01.9946521Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:26:01.9946994Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:26:01.9947464Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:26:01.9947925Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:26:01.9948390Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:26:01.9948849Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:26:01.9949308Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:26:01.9949764Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:26:01.9950231Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:26:01.9950723Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:26:01.9951197Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:26:01.9951660Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:26:01.9952107Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:26:01.9952578Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:26:01.9953178Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:26:01.9953657Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:26:01.9954139Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:26:01.9954699Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:26:01.9955145Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:01.9955584Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:26:01.9956055Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:01.9956532Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:26:01.9956956Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:26:01.9957350Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:26:01.9957834Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:26:01.9958367Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:26:01.9958895Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:26:01.9959412Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:26:01.9959871Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:26:01.9960457Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:26:01.9960938Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:26:01.9961390Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:26:01.9961831Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:01.9962264Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:26:01.9962686Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:26:01.9963072Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:01.9963486Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:26:01.9963895Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:26:01.9964298Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:26:01.9964730Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:26:01.9965194Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:26:01.9965647Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:26:01.9966103Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:26:01.9966566Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:26:01.9967024Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:26:01.9967487Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:26:01.9967956Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:26:01.9968429Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:26:01.9968904Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:26:01.9969386Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:26:01.9969868Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:26:01.9970358Z     libedit-3.1.20250104       | pl5321h7949ede_0         132 KB  conda-forge
2025-05-07T20:26:01.9970835Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:26:01.9971376Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:26:01.9971845Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:26:01.9972393Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:26:01.9972844Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:26:01.9973268Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:26:01.9973714Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:26:01.9974156Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:26:01.9974568Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:26:01.9974987Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:26:01.9975436Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:26:01.9975878Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:26:01.9976320Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:26:01.9976804Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:26:01.9977285Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:26:01.9977770Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:26:01.9978240Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:26:01.9978707Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:26:01.9979166Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:26:01.9979589Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:26:01.9980022Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:26:01.9980469Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:26:01.9980921Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:26:01.9981350Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:26:01.9981817Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:26:01.9982261Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:26:01.9982711Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:26:01.9983144Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:26:01.9983569Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:26:01.9983985Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:26:01.9984386Z     ncurses-6.5                |       h2d0b736_3         871 KB  conda-forge
2025-05-07T20:26:01.9984854Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:26:01.9985314Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:26:01.9985705Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:26:01.9986113Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:26:01.9986575Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:26:01.9987030Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:26:01.9987472Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:26:01.9987930Z     python-3.13.0              |h9ebbce0_101_cp313        31.5 MB  conda-forge
2025-05-07T20:26:01.9988468Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:26:01.9988893Z     sqlite-3.49.2              |       h9eae976_0         840 KB  conda-forge
2025-05-07T20:26:01.9989404Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:26:01.9989819Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:26:01.9990241Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:26:01.9990688Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:26:01.9991164Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:26:01.9991637Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:26:01.9992138Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:26:01.9992607Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:26:01.9993080Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:26:01.9993552Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:26:01.9993996Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:26:01.9994443Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:26:01.9994895Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:26:01.9995377Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:26:01.9995874Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:26:01.9996350Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:01.9996812Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:26:01.9997276Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:01.9997733Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:26:01.9998189Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:26:01.9998673Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:26:01.9999138Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:26:01.9999567Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:26:01.9999962Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:26:02.0000460Z     ------------------------------------------------------------
2025-05-07T20:26:02.0000806Z                                            Total:        1.91 GB
2025-05-07T20:26:02.0001028Z 
2025-05-07T20:26:02.0001160Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:02.0001390Z 
2025-05-07T20:26:02.0001618Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:26:02.0002049Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:26:02.0002490Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:26:02.0002966Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:26:02.0003410Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:26:02.0003888Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:26:02.0004498Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:02.0005093Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:26:02.0005653Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:26:02.0006222Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:26:02.0006846Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:26:02.0007386Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:26:02.0008346Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:02.0009188Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:26:02.0010145Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:02.0011112Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:02.0012045Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0012873Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:26:02.0013973Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:26:02.0014840Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0015670Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:26:02.0016585Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:26:02.0017424Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:26:02.0018214Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:26:02.0019095Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:26:02.0019975Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:26:02.0020729Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:26:02.0021536Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:26:02.0022353Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:26:02.0022948Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:26:02.0023534Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:26:02.0024115Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0024657Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0025195Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:26:02.0025725Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0026251Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:26:02.0026776Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:26:02.0027299Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0027851Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:26:02.0028440Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:26:02.0029009Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:26:02.0029546Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:26:02.0030051Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0030604Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:26:02.0031196Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:26:02.0031766Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:26:02.0032339Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0032916Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:26:02.0033660Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:02.0034169Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:26:02.0034848Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:02.0035410Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:26:02.0035886Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:26:02.0036422Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:26:02.0037055Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:26:02.0037676Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:26:02.0038276Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:26:02.0038800Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:26:02.0039330Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:26:02.0039847Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:26:02.0040473Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:26:02.0040925Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:26:02.0041373Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:26:02.0041819Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:26:02.0042222Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:26:02.0042657Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:26:02.0043099Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:26:02.0043520Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:26:02.0044002Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:26:02.0044540Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:26:02.0045064Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:26:02.0045590Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:26:02.0046114Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:26:02.0046645Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:26:02.0047172Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:26:02.0047703Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:26:02.0048251Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:26:02.0048818Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:26:02.0049388Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:26:02.0049958Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:26:02.0051039Z   libedit            conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 
2025-05-07T20:26:02.0051581Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:26:02.0052139Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:26:02.0052668Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:26:02.0053210Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:26:02.0053717Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:26:02.0054177Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:26:02.0054676Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:26:02.0055275Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:26:02.0055729Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:26:02.0056262Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:26:02.0056753Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:26:02.0057243Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:26:02.0057925Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:26:02.0058480Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:26:02.0059180Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:26:02.0059747Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:26:02.0060297Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:26:02.0060826Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:26:02.0061346Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:26:02.0061850Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:26:02.0062326Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:26:02.0062816Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:26:02.0063304Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:26:02.0063753Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:26:02.0064235Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:26:02.0064745Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:26:02.0065216Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:26:02.0065664Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:26:02.0066097Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:26:02.0066608Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:26:02.0067120Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:26:02.0067513Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:26:02.0067932Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:26:02.0068449Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:26:02.0068968Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:26:02.0069448Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:26:02.0069959Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:26:02.0070425Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:26:02.0071122Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:26:02.0071761Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:26:02.0072572Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:26:02.0073236Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:26:02.0073840Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:26:02.0074392Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:26:02.0074921Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:26:02.0084632Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:26:02.0085227Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:26:02.0085914Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:26:02.0086421Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:26:02.0087074Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:26:02.0087943Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:26:02.0088647Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:26:02.0089181Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:26:02.0089724Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:26:02.0090251Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:26:02.0090815Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:26:02.0091387Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:26:02.0091997Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:26:02.0092680Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:26:02.0093043Z 
2025-05-07T20:26:02.0093183Z The following packages will be UPDATED:
2025-05-07T20:26:02.0093401Z 
2025-05-07T20:26:02.0093699Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:26:02.0094345Z   ncurses                 pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 
2025-05-07T20:26:02.0094966Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 
2025-05-07T20:26:02.0095566Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:26:02.0095911Z 
2025-05-07T20:26:02.0096136Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:26:02.0096462Z 
2025-05-07T20:26:02.0096718Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:26:02.0097360Z   python             pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 
2025-05-07T20:26:02.0098000Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:26:02.0098337Z 
2025-05-07T20:26:02.0098363Z 
2025-05-07T20:26:02.0098367Z 
2025-05-07T20:26:02.0098517Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:02.0099005Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:02.0099348Z 
2025-05-07T20:26:02.0099778Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:26:02.0100035Z 
2025-05-07T20:26:02.0100039Z 
2025-05-07T20:26:02.0100265Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:26:02.0100531Z 
2025-05-07T20:26:02.0100535Z 
2025-05-07T20:26:02.0100539Z 
2025-05-07T20:26:02.0100772Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:26:02.0101040Z 
2025-05-07T20:26:02.0101043Z 
2025-05-07T20:26:02.0101047Z 
2025-05-07T20:26:02.0101055Z 
2025-05-07T20:26:02.0101287Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:02.0101552Z 
2025-05-07T20:26:02.0101556Z 
2025-05-07T20:26:02.0101559Z 
2025-05-07T20:26:02.0101563Z 
2025-05-07T20:26:02.0101571Z 
2025-05-07T20:26:02.0104303Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:02.0104596Z 
2025-05-07T20:26:02.0104600Z 
2025-05-07T20:26:02.0104604Z 
2025-05-07T20:26:02.0104607Z 
2025-05-07T20:26:02.0104611Z 
2025-05-07T20:26:02.0104614Z 
2025-05-07T20:26:02.0105408Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:02.0105792Z 
2025-05-07T20:26:02.0105796Z 
2025-05-07T20:26:02.0105800Z 
2025-05-07T20:26:02.0105804Z 
2025-05-07T20:26:02.0105807Z 
2025-05-07T20:26:02.0105811Z 
2025-05-07T20:26:02.0105818Z 
2025-05-07T20:26:02.0107831Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:02.0108135Z 
2025-05-07T20:26:02.0108139Z 
2025-05-07T20:26:02.0108222Z 
2025-05-07T20:26:02.0108226Z 
2025-05-07T20:26:02.0108230Z 
2025-05-07T20:26:02.0108233Z 
2025-05-07T20:26:02.0108237Z 
2025-05-07T20:26:02.0109729Z 
2025-05-07T20:26:02.0111455Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0111861Z 
2025-05-07T20:26:02.0111865Z 
2025-05-07T20:26:02.0111869Z 
2025-05-07T20:26:02.0111872Z 
2025-05-07T20:26:02.0111876Z 
2025-05-07T20:26:02.0111880Z 
2025-05-07T20:26:02.0111883Z 
2025-05-07T20:26:02.0111887Z 
2025-05-07T20:26:02.0111891Z 
2025-05-07T20:26:02.0114168Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0114522Z 
2025-05-07T20:26:02.0114526Z 
2025-05-07T20:26:02.0114530Z 
2025-05-07T20:26:02.0114536Z 
2025-05-07T20:26:02.0114541Z 
2025-05-07T20:26:02.0114547Z 
2025-05-07T20:26:02.0114562Z 
2025-05-07T20:26:02.0114568Z 
2025-05-07T20:26:02.0114573Z 
2025-05-07T20:26:02.0114579Z 
2025-05-07T20:26:02.0115098Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0115457Z 
2025-05-07T20:26:02.0115461Z 
2025-05-07T20:26:02.0115464Z 
2025-05-07T20:26:02.0115468Z 
2025-05-07T20:26:02.0115472Z 
2025-05-07T20:26:02.0115484Z 
2025-05-07T20:26:02.0115488Z 
2025-05-07T20:26:02.0115492Z 
2025-05-07T20:26:02.0115495Z 
2025-05-07T20:26:02.0115499Z 
2025-05-07T20:26:02.0115502Z 
2025-05-07T20:26:02.0117216Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0117539Z 
2025-05-07T20:26:02.0117545Z 
2025-05-07T20:26:02.0117550Z 
2025-05-07T20:26:02.0117555Z 
2025-05-07T20:26:02.0117561Z 
2025-05-07T20:26:02.0117566Z 
2025-05-07T20:26:02.0117571Z 
2025-05-07T20:26:02.0117576Z 
2025-05-07T20:26:02.0117581Z 
2025-05-07T20:26:02.0117586Z 
2025-05-07T20:26:02.0117591Z 
2025-05-07T20:26:02.0117750Z 
2025-05-07T20:26:02.0119842Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0120429Z 
2025-05-07T20:26:02.0120435Z 
2025-05-07T20:26:02.0120459Z 
2025-05-07T20:26:02.0120465Z 
2025-05-07T20:26:02.0120470Z 
2025-05-07T20:26:02.0120475Z 
2025-05-07T20:26:02.0120481Z 
2025-05-07T20:26:02.0120486Z 
2025-05-07T20:26:02.0120491Z 
2025-05-07T20:26:02.0120496Z 
2025-05-07T20:26:02.0120501Z 
2025-05-07T20:26:02.0120506Z 
2025-05-07T20:26:02.0120512Z 
2025-05-07T20:26:02.0121099Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0121570Z 
2025-05-07T20:26:02.0121576Z 
2025-05-07T20:26:02.0121581Z 
2025-05-07T20:26:02.0121586Z 
2025-05-07T20:26:02.0121591Z 
2025-05-07T20:26:02.0121596Z 
2025-05-07T20:26:02.0121602Z 
2025-05-07T20:26:02.0121607Z 
2025-05-07T20:26:02.0121612Z 
2025-05-07T20:26:02.0121617Z 
2025-05-07T20:26:02.0121622Z 
2025-05-07T20:26:02.0121628Z 
2025-05-07T20:26:02.0121633Z 
2025-05-07T20:26:02.0121647Z 
2025-05-07T20:26:02.0122813Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0123247Z 
2025-05-07T20:26:02.0123260Z 
2025-05-07T20:26:02.0123265Z 
2025-05-07T20:26:02.0123270Z 
2025-05-07T20:26:02.0123275Z 
2025-05-07T20:26:02.0123280Z 
2025-05-07T20:26:02.0123285Z 
2025-05-07T20:26:02.0123291Z 
2025-05-07T20:26:02.0123301Z 
2025-05-07T20:26:02.0123313Z 
2025-05-07T20:26:02.0123318Z 
2025-05-07T20:26:02.0123324Z 
2025-05-07T20:26:02.0123329Z 
2025-05-07T20:26:02.0123333Z 
2025-05-07T20:26:02.0123341Z 
2025-05-07T20:26:02.0125457Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0125918Z 
2025-05-07T20:26:02.0125923Z 
2025-05-07T20:26:02.0125928Z 
2025-05-07T20:26:02.0125933Z 
2025-05-07T20:26:02.0125939Z 
2025-05-07T20:26:02.0125944Z 
2025-05-07T20:26:02.0125949Z 
2025-05-07T20:26:02.0125954Z 
2025-05-07T20:26:02.0125959Z 
2025-05-07T20:26:02.0126228Z 
2025-05-07T20:26:02.0126235Z 
2025-05-07T20:26:02.0126240Z 
2025-05-07T20:26:02.0126245Z 
2025-05-07T20:26:02.0126250Z 
2025-05-07T20:26:02.0126255Z 
2025-05-07T20:26:02.0126260Z 
2025-05-07T20:26:02.0128425Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0128879Z 
2025-05-07T20:26:02.0128884Z 
2025-05-07T20:26:02.0128889Z 
2025-05-07T20:26:02.0128894Z 
2025-05-07T20:26:02.0128899Z 
2025-05-07T20:26:02.0128904Z 
2025-05-07T20:26:02.0128909Z 
2025-05-07T20:26:02.0128922Z 
2025-05-07T20:26:02.0128928Z 
2025-05-07T20:26:02.0128933Z 
2025-05-07T20:26:02.0128938Z 
2025-05-07T20:26:02.0128943Z 
2025-05-07T20:26:02.0128948Z 
2025-05-07T20:26:02.0128958Z 
2025-05-07T20:26:02.0128963Z 
2025-05-07T20:26:02.0128968Z 
2025-05-07T20:26:02.0128973Z 
2025-05-07T20:26:02.0130542Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0131037Z 
2025-05-07T20:26:02.0131043Z 
2025-05-07T20:26:02.0131059Z 
2025-05-07T20:26:02.0131065Z 
2025-05-07T20:26:02.0131070Z 
2025-05-07T20:26:02.0131075Z 
2025-05-07T20:26:02.0131080Z 
2025-05-07T20:26:02.0131086Z 
2025-05-07T20:26:02.0131098Z 
2025-05-07T20:26:02.0131103Z 
2025-05-07T20:26:02.0131109Z 
2025-05-07T20:26:02.0131114Z 
2025-05-07T20:26:02.0131119Z 
2025-05-07T20:26:02.0131124Z 
2025-05-07T20:26:02.0131129Z 
2025-05-07T20:26:02.0131134Z 
2025-05-07T20:26:02.0131147Z 
2025-05-07T20:26:02.0131152Z 
2025-05-07T20:26:02.0132931Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.0133393Z 
2025-05-07T20:26:02.0133407Z 
2025-05-07T20:26:02.0133412Z 
2025-05-07T20:26:02.0133417Z 
2025-05-07T20:26:02.0133422Z 
2025-05-07T20:26:02.0133427Z 
2025-05-07T20:26:02.0133432Z 
2025-05-07T20:26:02.0133438Z 
2025-05-07T20:26:02.0133443Z 
2025-05-07T20:26:02.0133448Z 
2025-05-07T20:26:02.0133453Z 
2025-05-07T20:26:02.0133458Z 
2025-05-07T20:26:02.0133463Z 
2025-05-07T20:26:02.0133476Z 
2025-05-07T20:26:02.0133481Z 
2025-05-07T20:26:02.0133486Z 
2025-05-07T20:26:02.0133492Z 
2025-05-07T20:26:02.0133497Z 
2025-05-07T20:26:02.0133502Z 
2025-05-07T20:26:02.1025561Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:02.1033125Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:02.1033636Z 
2025-05-07T20:26:02.1038041Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:26:02.1038424Z 
2025-05-07T20:26:02.1040773Z 
2025-05-07T20:26:02.1060287Z libcusparse-12.5.7.5 | 164.9 MB  | 1          |   1% [A[A
2025-05-07T20:26:02.1060673Z 
2025-05-07T20:26:02.1060679Z 
2025-05-07T20:26:02.1061253Z 
2025-05-07T20:26:02.1087248Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:26:02.1087653Z 
2025-05-07T20:26:02.1087659Z 
2025-05-07T20:26:02.1087664Z 
2025-05-07T20:26:02.1087937Z 
2025-05-07T20:26:02.2028830Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:02.2037944Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   1% 
2025-05-07T20:26:02.2038229Z 
2025-05-07T20:26:02.2042980Z nsight-compute-2025. | 320.6 MB  | 1          |   2% [A
2025-05-07T20:26:02.2043280Z 
2025-05-07T20:26:02.2043284Z 
2025-05-07T20:26:02.2061557Z libcusparse-12.5.7.5 | 164.9 MB  | 3          |   4% [A[A
2025-05-07T20:26:02.2061892Z 
2025-05-07T20:26:02.2061898Z 
2025-05-07T20:26:02.2061904Z 
2025-05-07T20:26:02.2091725Z libcusolver-11.7.2.5 | 156.9 MB  | 2          |   3% [A[A[A
2025-05-07T20:26:02.2092096Z 
2025-05-07T20:26:02.2092100Z 
2025-05-07T20:26:02.2092104Z 
2025-05-07T20:26:02.2092920Z 
2025-05-07T20:26:02.3029791Z libcufft-11.3.3.41   | 147.4 MB  |            |   1% [A[A[A[A
2025-05-07T20:26:02.3038945Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   2% 
2025-05-07T20:26:02.3044258Z 
2025-05-07T20:26:02.3047234Z nsight-compute-2025. | 320.6 MB  | 2          |   3% [A
2025-05-07T20:26:02.3047609Z 
2025-05-07T20:26:02.3047615Z 
2025-05-07T20:26:02.3063060Z libcusparse-12.5.7.5 | 164.9 MB  | 5          |   6% [A[A
2025-05-07T20:26:02.3063459Z 
2025-05-07T20:26:02.3063464Z 
2025-05-07T20:26:02.3063468Z 
2025-05-07T20:26:02.3091875Z libcusolver-11.7.2.5 | 156.9 MB  | 5          |   5% [A[A[A
2025-05-07T20:26:02.3092394Z 
2025-05-07T20:26:02.3092398Z 
2025-05-07T20:26:02.3092402Z 
2025-05-07T20:26:02.3093000Z 
2025-05-07T20:26:02.4032950Z libcufft-11.3.3.41   | 147.4 MB  | 3          |   3% [A[A[A[A
2025-05-07T20:26:02.4038637Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   3% 
2025-05-07T20:26:02.4041798Z 
2025-05-07T20:26:02.4047390Z nsight-compute-2025. | 320.6 MB  | 3          |   4% [A
2025-05-07T20:26:02.4047670Z 
2025-05-07T20:26:02.4047675Z 
2025-05-07T20:26:02.4065794Z libcusparse-12.5.7.5 | 164.9 MB  | 7          |   8% [A[A
2025-05-07T20:26:02.4066156Z 
2025-05-07T20:26:02.4066160Z 
2025-05-07T20:26:02.4068161Z 
2025-05-07T20:26:02.4093074Z libcusolver-11.7.2.5 | 156.9 MB  | 7          |   7% [A[A[A
2025-05-07T20:26:02.4093432Z 
2025-05-07T20:26:02.4093460Z 
2025-05-07T20:26:02.4093464Z 
2025-05-07T20:26:02.4094713Z 
2025-05-07T20:26:02.5039340Z libcufft-11.3.3.41   | 147.4 MB  | 5          |   5% [A[A[A[A
2025-05-07T20:26:02.5041857Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   3% 
2025-05-07T20:26:02.5043131Z 
2025-05-07T20:26:02.5053685Z nsight-compute-2025. | 320.6 MB  | 4          |   5% [A
2025-05-07T20:26:02.5053976Z 
2025-05-07T20:26:02.5053980Z 
2025-05-07T20:26:02.5067570Z libcusparse-12.5.7.5 | 164.9 MB  | #          |  10% [A[A
2025-05-07T20:26:02.5067936Z 
2025-05-07T20:26:02.5067949Z 
2025-05-07T20:26:02.5069952Z 
2025-05-07T20:26:02.5099175Z libcusolver-11.7.2.5 | 156.9 MB  | 9          |  10% [A[A[A
2025-05-07T20:26:02.5099544Z 
2025-05-07T20:26:02.5099548Z 
2025-05-07T20:26:02.5099559Z 
2025-05-07T20:26:02.5100042Z 
2025-05-07T20:26:02.6041021Z libcufft-11.3.3.41   | 147.4 MB  | 7          |   7% [A[A[A[A
2025-05-07T20:26:02.6042845Z 
2025-05-07T20:26:02.6046270Z nsight-compute-2025. | 320.6 MB  | 5          |   6% [A
2025-05-07T20:26:02.6056313Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   4% 
2025-05-07T20:26:02.6056562Z 
2025-05-07T20:26:02.6058029Z 
2025-05-07T20:26:02.6068791Z libcusparse-12.5.7.5 | 164.9 MB  | #2         |  12% [A[A
2025-05-07T20:26:02.6069378Z 
2025-05-07T20:26:02.6069382Z 
2025-05-07T20:26:02.6069386Z 
2025-05-07T20:26:02.6100607Z libcusolver-11.7.2.5 | 156.9 MB  | #1         |  12% [A[A[A
2025-05-07T20:26:02.6100932Z 
2025-05-07T20:26:02.6100936Z 
2025-05-07T20:26:02.6100940Z 
2025-05-07T20:26:02.6101290Z 
2025-05-07T20:26:02.7041858Z libcufft-11.3.3.41   | 147.4 MB  | 9          |  10% [A[A[A[A
2025-05-07T20:26:02.7042383Z 
2025-05-07T20:26:02.7045102Z nsight-compute-2025. | 320.6 MB  | 6          |   7% [A
2025-05-07T20:26:02.7058476Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:26:02.7058891Z 
2025-05-07T20:26:02.7058897Z 
2025-05-07T20:26:02.7071349Z libcusparse-12.5.7.5 | 164.9 MB  | #4         |  15% [A[A
2025-05-07T20:26:02.7071694Z 
2025-05-07T20:26:02.7071721Z 
2025-05-07T20:26:02.7071726Z 
2025-05-07T20:26:02.7104420Z libcusolver-11.7.2.5 | 156.9 MB  | #4         |  14% [A[A[A
2025-05-07T20:26:02.7104708Z 
2025-05-07T20:26:02.7104723Z 
2025-05-07T20:26:02.7104734Z 
2025-05-07T20:26:02.7108890Z 
2025-05-07T20:26:02.8042461Z libcufft-11.3.3.41   | 147.4 MB  | #2         |  12% [A[A[A[A
2025-05-07T20:26:02.8043950Z 
2025-05-07T20:26:02.8063447Z nsight-compute-2025. | 320.6 MB  | 7          |   8% [A
2025-05-07T20:26:02.8063766Z 
2025-05-07T20:26:02.8066021Z 
2025-05-07T20:26:02.8108250Z libcusparse-12.5.7.5 | 164.9 MB  | #6         |  17% [A[A
2025-05-07T20:26:02.8108538Z 
2025-05-07T20:26:02.8108542Z 
2025-05-07T20:26:02.8109933Z 
2025-05-07T20:26:02.8112938Z libcusolver-11.7.2.5 | 156.9 MB  | #6         |  17% [A[A[A
2025-05-07T20:26:02.8113231Z 
2025-05-07T20:26:02.8113235Z 
2025-05-07T20:26:02.8113239Z 
2025-05-07T20:26:02.8113242Z 
2025-05-07T20:26:02.8124680Z libcufft-11.3.3.41   | 147.4 MB  | #4         |  14% [A[A[A[A
2025-05-07T20:26:02.9042815Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   6% 
2025-05-07T20:26:02.9044412Z 
2025-05-07T20:26:02.9066853Z nsight-compute-2025. | 320.6 MB  | 9          |   9% [A
2025-05-07T20:26:02.9067503Z 
2025-05-07T20:26:02.9067976Z 
2025-05-07T20:26:02.9116480Z libcusparse-12.5.7.5 | 164.9 MB  | #9         |  19% [A[A
2025-05-07T20:26:02.9116759Z 
2025-05-07T20:26:02.9116763Z 
2025-05-07T20:26:02.9116766Z 
2025-05-07T20:26:02.9118130Z 
2025-05-07T20:26:02.9121367Z libcufft-11.3.3.41   | 147.4 MB  | #6         |  17% [A[A[A[A
2025-05-07T20:26:02.9121757Z 
2025-05-07T20:26:02.9121761Z 
2025-05-07T20:26:02.9125895Z 
2025-05-07T20:26:02.9133961Z libcusolver-11.7.2.5 | 156.9 MB  | #8         |  19% [A[A[A
2025-05-07T20:26:03.0046766Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:26:03.0048544Z 
2025-05-07T20:26:03.0118638Z nsight-compute-2025. | 320.6 MB  | #          |  10% [A
2025-05-07T20:26:03.0118933Z 
2025-05-07T20:26:03.0118937Z 
2025-05-07T20:26:03.0118966Z 
2025-05-07T20:26:03.0119342Z 
2025-05-07T20:26:03.0124424Z libcufft-11.3.3.41   | 147.4 MB  | #9         |  19% [A[A[A[A
2025-05-07T20:26:03.0124870Z 
2025-05-07T20:26:03.0124892Z 
2025-05-07T20:26:03.0124910Z 
2025-05-07T20:26:03.0131994Z libcusolver-11.7.2.5 | 156.9 MB  | ##1        |  21% [A[A[A
2025-05-07T20:26:03.0147704Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   8% 
2025-05-07T20:26:03.0147970Z 
2025-05-07T20:26:03.0147974Z 
2025-05-07T20:26:03.1051208Z libcusparse-12.5.7.5 | 164.9 MB  | ##1        |  22% [A[A
2025-05-07T20:26:03.1051948Z 
2025-05-07T20:26:03.1119586Z nsight-compute-2025. | 320.6 MB  | #1         |  11% [A
2025-05-07T20:26:03.1119874Z 
2025-05-07T20:26:03.1119878Z 
2025-05-07T20:26:03.1119882Z 
2025-05-07T20:26:03.1121385Z 
2025-05-07T20:26:03.1127734Z libcufft-11.3.3.41   | 147.4 MB  | ##1        |  21% [A[A[A[A
2025-05-07T20:26:03.1128028Z 
2025-05-07T20:26:03.1128032Z 
2025-05-07T20:26:03.1130997Z 
2025-05-07T20:26:03.1148280Z libcusolver-11.7.2.5 | 156.9 MB  | ##3        |  24% [A[A[A
2025-05-07T20:26:03.1148585Z 
2025-05-07T20:26:03.1149366Z 
2025-05-07T20:26:03.1157624Z libcusparse-12.5.7.5 | 164.9 MB  | ##3        |  24% [A[A
2025-05-07T20:26:03.2051342Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   8% 
2025-05-07T20:26:03.2052654Z 
2025-05-07T20:26:03.2120513Z nsight-compute-2025. | 320.6 MB  | #2         |  12% [A
2025-05-07T20:26:03.2120798Z 
2025-05-07T20:26:03.2120802Z 
2025-05-07T20:26:03.2120806Z 
2025-05-07T20:26:03.2120809Z 
2025-05-07T20:26:03.2126445Z libcufft-11.3.3.41   | 147.4 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:26:03.2126807Z 
2025-05-07T20:26:03.2126813Z 
2025-05-07T20:26:03.2132174Z 
2025-05-07T20:26:03.2161024Z libcusolver-11.7.2.5 | 156.9 MB  | ##5        |  26% [A[A[A
2025-05-07T20:26:03.2229004Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:26:03.2229265Z 
2025-05-07T20:26:03.2229277Z 
2025-05-07T20:26:03.3058154Z libcusparse-12.5.7.5 | 164.9 MB  | ##6        |  26% [A[A
2025-05-07T20:26:03.3059196Z 
2025-05-07T20:26:03.3128205Z nsight-compute-2025. | 320.6 MB  | #3         |  14% [A
2025-05-07T20:26:03.3128554Z 
2025-05-07T20:26:03.3128559Z 
2025-05-07T20:26:03.3128580Z 
2025-05-07T20:26:03.3133808Z libcusolver-11.7.2.5 | 156.9 MB  | ##8        |  28% [A[A[A
2025-05-07T20:26:03.3134148Z 
2025-05-07T20:26:03.3134154Z 
2025-05-07T20:26:03.3134159Z 
2025-05-07T20:26:03.3134165Z 
2025-05-07T20:26:03.3179387Z libcufft-11.3.3.41   | 147.4 MB  | ##6        |  26% [A[A[A[A
2025-05-07T20:26:03.3229315Z libcublas-12.8.3.14  | 460.2 MB  | 9          |  10% 
2025-05-07T20:26:03.3229589Z 
2025-05-07T20:26:03.3232853Z 
2025-05-07T20:26:03.4061072Z libcusparse-12.5.7.5 | 164.9 MB  | ##8        |  28% [A[A
2025-05-07T20:26:03.4061834Z 
2025-05-07T20:26:03.4149870Z nsight-compute-2025. | 320.6 MB  | #4         |  15% [A
2025-05-07T20:26:03.4150260Z 
2025-05-07T20:26:03.4150274Z 
2025-05-07T20:26:03.4153126Z 
2025-05-07T20:26:03.4183741Z libcusolver-11.7.2.5 | 156.9 MB  | ###        |  31% [A[A[A
2025-05-07T20:26:03.4203845Z libcublas-12.8.3.14  | 460.2 MB  | #          |  11% 
2025-05-07T20:26:03.4204201Z 
2025-05-07T20:26:03.4204205Z 
2025-05-07T20:26:03.4204209Z 
2025-05-07T20:26:03.4205334Z 
2025-05-07T20:26:03.4238949Z libcufft-11.3.3.41   | 147.4 MB  | ##8        |  28% [A[A[A[A
2025-05-07T20:26:03.4239238Z 
2025-05-07T20:26:03.4239246Z 
2025-05-07T20:26:03.5061355Z libcusparse-12.5.7.5 | 164.9 MB  | ###        |  31% [A[A
2025-05-07T20:26:03.5061655Z 
2025-05-07T20:26:03.5152050Z nsight-compute-2025. | 320.6 MB  | #5         |  16% [A
2025-05-07T20:26:03.5152383Z 
2025-05-07T20:26:03.5152389Z 
2025-05-07T20:26:03.5156877Z 
2025-05-07T20:26:03.5185454Z libcusolver-11.7.2.5 | 156.9 MB  | ###2       |  33% [A[A[A
2025-05-07T20:26:03.5207522Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:26:03.5207784Z 
2025-05-07T20:26:03.5207789Z 
2025-05-07T20:26:03.5207792Z 
2025-05-07T20:26:03.5208253Z 
2025-05-07T20:26:03.5239712Z libcufft-11.3.3.41   | 147.4 MB  | ###        |  31% [A[A[A[A
2025-05-07T20:26:03.5240056Z 
2025-05-07T20:26:03.5240062Z 
2025-05-07T20:26:03.6112765Z libcusparse-12.5.7.5 | 164.9 MB  | ###2       |  33% [A[A
2025-05-07T20:26:03.6115046Z 
2025-05-07T20:26:03.6161426Z nsight-compute-2025. | 320.6 MB  | #7         |  17% [A
2025-05-07T20:26:03.6161740Z 
2025-05-07T20:26:03.6161746Z 
2025-05-07T20:26:03.6164002Z 
2025-05-07T20:26:03.6186098Z libcusolver-11.7.2.5 | 156.9 MB  | ###5       |  35% [A[A[A
2025-05-07T20:26:03.6209609Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  12% 
2025-05-07T20:26:03.6209869Z 
2025-05-07T20:26:03.6209874Z 
2025-05-07T20:26:03.6209877Z 
2025-05-07T20:26:03.6211091Z 
2025-05-07T20:26:03.6248798Z libcufft-11.3.3.41   | 147.4 MB  | ###3       |  33% [A[A[A[A
2025-05-07T20:26:03.6249092Z 
2025-05-07T20:26:03.6249096Z 
2025-05-07T20:26:03.7141678Z libcusparse-12.5.7.5 | 164.9 MB  | ###4       |  35% [A[A
2025-05-07T20:26:03.7143130Z 
2025-05-07T20:26:03.7200969Z nsight-compute-2025. | 320.6 MB  | #8         |  18% [A
2025-05-07T20:26:03.7201407Z 
2025-05-07T20:26:03.7201413Z 
2025-05-07T20:26:03.7201418Z 
2025-05-07T20:26:03.7228564Z libcusolver-11.7.2.5 | 156.9 MB  | ###7       |  38% [A[A[A
2025-05-07T20:26:03.7257467Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  13% 
2025-05-07T20:26:03.7257767Z 
2025-05-07T20:26:03.7257773Z 
2025-05-07T20:26:03.7271607Z libcusparse-12.5.7.5 | 164.9 MB  | ###7       |  37% [A[A
2025-05-07T20:26:03.7271940Z 
2025-05-07T20:26:03.7271944Z 
2025-05-07T20:26:03.7271947Z 
2025-05-07T20:26:03.7271951Z 
2025-05-07T20:26:03.8168557Z libcufft-11.3.3.41   | 147.4 MB  | ###5       |  35% [A[A[A[A
2025-05-07T20:26:03.8169224Z 
2025-05-07T20:26:03.8237712Z nsight-compute-2025. | 320.6 MB  | #9         |  19% [A
2025-05-07T20:26:03.8254503Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  14% 
2025-05-07T20:26:03.8254806Z 
2025-05-07T20:26:03.8254810Z 
2025-05-07T20:26:03.8256261Z 
2025-05-07T20:26:03.8272938Z libcusolver-11.7.2.5 | 156.9 MB  | ###9       |  40% [A[A[A
2025-05-07T20:26:03.8273302Z 
2025-05-07T20:26:03.8273309Z 
2025-05-07T20:26:03.8273314Z 
2025-05-07T20:26:03.8274726Z 
2025-05-07T20:26:03.8303776Z libcufft-11.3.3.41   | 147.4 MB  | ###7       |  38% [A[A[A[A
2025-05-07T20:26:03.8304158Z 
2025-05-07T20:26:03.8304695Z 
2025-05-07T20:26:03.9193624Z libcusparse-12.5.7.5 | 164.9 MB  | ###9       |  39% [A[A
2025-05-07T20:26:03.9194664Z 
2025-05-07T20:26:03.9238960Z nsight-compute-2025. | 320.6 MB  | ##         |  20% [A
2025-05-07T20:26:03.9255016Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  15% 
2025-05-07T20:26:03.9255364Z 
2025-05-07T20:26:03.9255369Z 
2025-05-07T20:26:03.9255373Z 
2025-05-07T20:26:03.9306903Z libcusolver-11.7.2.5 | 156.9 MB  | ####2      |  42% [A[A[A
2025-05-07T20:26:03.9307188Z 
2025-05-07T20:26:03.9307944Z 
2025-05-07T20:26:03.9313171Z libcusparse-12.5.7.5 | 164.9 MB  | ####1      |  42% [A[A
2025-05-07T20:26:03.9313674Z 
2025-05-07T20:26:03.9313679Z 
2025-05-07T20:26:03.9313683Z 
2025-05-07T20:26:03.9314764Z 
2025-05-07T20:26:04.0196016Z libcufft-11.3.3.41   | 147.4 MB  | ####       |  40% [A[A[A[A
2025-05-07T20:26:04.0197968Z 
2025-05-07T20:26:04.0248423Z nsight-compute-2025. | 320.6 MB  | ##1        |  21% [A
2025-05-07T20:26:04.0282102Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  16% 
2025-05-07T20:26:04.0282471Z 
2025-05-07T20:26:04.0282477Z 
2025-05-07T20:26:04.0284903Z 
2025-05-07T20:26:04.0326210Z libcusolver-11.7.2.5 | 156.9 MB  | ####4      |  44% [A[A[A
2025-05-07T20:26:04.0326500Z 
2025-05-07T20:26:04.0327255Z 
2025-05-07T20:26:04.0355247Z libcusparse-12.5.7.5 | 164.9 MB  | ####3      |  44% [A[A
2025-05-07T20:26:04.0355606Z 
2025-05-07T20:26:04.0355612Z 
2025-05-07T20:26:04.0355617Z 
2025-05-07T20:26:04.0355622Z 
2025-05-07T20:26:04.1243177Z libcufft-11.3.3.41   | 147.4 MB  | ####2      |  42% [A[A[A[A
2025-05-07T20:26:04.1243464Z 
2025-05-07T20:26:04.1272419Z nsight-compute-2025. | 320.6 MB  | ##2        |  23% [A
2025-05-07T20:26:04.1302042Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  16% 
2025-05-07T20:26:04.1302364Z 
2025-05-07T20:26:04.1302370Z 
2025-05-07T20:26:04.1310479Z 
2025-05-07T20:26:04.1340844Z libcusolver-11.7.2.5 | 156.9 MB  | ####6      |  47% [A[A[A
2025-05-07T20:26:04.1341156Z 
2025-05-07T20:26:04.1341160Z 
2025-05-07T20:26:04.1362940Z libcusparse-12.5.7.5 | 164.9 MB  | ####5      |  46% [A[A
2025-05-07T20:26:04.1363238Z 
2025-05-07T20:26:04.1363257Z 
2025-05-07T20:26:04.1363262Z 
2025-05-07T20:26:04.1363268Z 
2025-05-07T20:26:04.2257675Z libcufft-11.3.3.41   | 147.4 MB  | ####4      |  44% [A[A[A[A
2025-05-07T20:26:04.2257969Z 
2025-05-07T20:26:04.2297022Z nsight-compute-2025. | 320.6 MB  | ##3        |  24% [A
2025-05-07T20:26:04.2363329Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  17% 
2025-05-07T20:26:04.2363636Z 
2025-05-07T20:26:04.2363640Z 
2025-05-07T20:26:04.2363644Z 
2025-05-07T20:26:04.2363647Z 
2025-05-07T20:26:04.2401372Z libcufft-11.3.3.41   | 147.4 MB  | ####6      |  47% [A[A[A[A
2025-05-07T20:26:04.2401646Z 
2025-05-07T20:26:04.2402340Z 
2025-05-07T20:26:04.3082008Z libcusparse-12.5.7.5 | 164.9 MB  | ####8      |  48% [A[A
2025-05-07T20:26:04.3082320Z 
2025-05-07T20:26:04.3082324Z 
2025-05-07T20:26:04.3083626Z 
2025-05-07T20:26:04.3258269Z libcusolver-11.7.2.5 | 156.9 MB  | ####9      |  49% [A[A[A
2025-05-07T20:26:04.3258639Z 
2025-05-07T20:26:04.3296918Z nsight-compute-2025. | 320.6 MB  | ##4        |  25% [A
2025-05-07T20:26:04.3365851Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  18% 
2025-05-07T20:26:04.3366176Z 
2025-05-07T20:26:04.3366180Z 
2025-05-07T20:26:04.3366184Z 
2025-05-07T20:26:04.3370049Z 
2025-05-07T20:26:04.3402360Z libcufft-11.3.3.41   | 147.4 MB  | ####9      |  49% [A[A[A[A
2025-05-07T20:26:04.3402645Z 
2025-05-07T20:26:04.3404163Z 
2025-05-07T20:26:04.4171645Z libcusparse-12.5.7.5 | 164.9 MB  | #####      |  50% [A[A
2025-05-07T20:26:04.4171936Z 
2025-05-07T20:26:04.4171940Z 
2025-05-07T20:26:04.4172745Z 
2025-05-07T20:26:04.4333250Z libcusolver-11.7.2.5 | 156.9 MB  | #####      |  51% [A[A[A
2025-05-07T20:26:04.4333686Z 
2025-05-07T20:26:04.4392570Z nsight-compute-2025. | 320.6 MB  | ##6        |  26% [A
2025-05-07T20:26:04.4392872Z 
2025-05-07T20:26:04.4392876Z 
2025-05-07T20:26:04.4392890Z 
2025-05-07T20:26:04.4396932Z 
2025-05-07T20:26:04.4438736Z libcufft-11.3.3.41   | 147.4 MB  | #####1     |  52% [A[A[A[A
2025-05-07T20:26:04.4543829Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  19% 
2025-05-07T20:26:04.4544166Z 
2025-05-07T20:26:04.4544170Z 
2025-05-07T20:26:04.5182931Z libcusparse-12.5.7.5 | 164.9 MB  | #####2     |  52% [A[A
2025-05-07T20:26:04.5183326Z 
2025-05-07T20:26:04.5183331Z 
2025-05-07T20:26:04.5188281Z 
2025-05-07T20:26:04.5333736Z libcusolver-11.7.2.5 | 156.9 MB  | #####2     |  53% [A[A[A
2025-05-07T20:26:04.5334028Z 
2025-05-07T20:26:04.5419082Z nsight-compute-2025. | 320.6 MB  | ##7        |  27% [A
2025-05-07T20:26:04.5419428Z 
2025-05-07T20:26:04.5419434Z 
2025-05-07T20:26:04.5419439Z 
2025-05-07T20:26:04.5419444Z 
2025-05-07T20:26:04.5562566Z libcufft-11.3.3.41   | 147.4 MB  | #####3     |  54% [A[A[A[A
2025-05-07T20:26:04.5591694Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  20% 
2025-05-07T20:26:04.5592101Z 
2025-05-07T20:26:04.5595999Z 
2025-05-07T20:26:04.6189960Z libcusparse-12.5.7.5 | 164.9 MB  | #####4     |  55% [A[A
2025-05-07T20:26:04.6190251Z 
2025-05-07T20:26:04.6190256Z 
2025-05-07T20:26:04.6191659Z 
2025-05-07T20:26:04.6352926Z libcusolver-11.7.2.5 | 156.9 MB  | #####4     |  55% [A[A[A
2025-05-07T20:26:04.6355823Z 
2025-05-07T20:26:04.6435407Z nsight-compute-2025. | 320.6 MB  | ##8        |  28% [A
2025-05-07T20:26:04.6435696Z 
2025-05-07T20:26:04.6435700Z 
2025-05-07T20:26:04.6435704Z 
2025-05-07T20:26:04.6437491Z 
2025-05-07T20:26:04.6569872Z libcufft-11.3.3.41   | 147.4 MB  | #####6     |  56% [A[A[A[A
2025-05-07T20:26:04.6660456Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  21% 
2025-05-07T20:26:04.6660750Z 
2025-05-07T20:26:04.6665555Z 
2025-05-07T20:26:04.7194210Z libcusparse-12.5.7.5 | 164.9 MB  | #####6     |  57% [A[A
2025-05-07T20:26:04.7194504Z 
2025-05-07T20:26:04.7194508Z 
2025-05-07T20:26:04.7194511Z 
2025-05-07T20:26:04.7438391Z libcusolver-11.7.2.5 | 156.9 MB  | #####6     |  57% [A[A[A
2025-05-07T20:26:04.7438709Z 
2025-05-07T20:26:04.7438714Z 
2025-05-07T20:26:04.7438717Z 
2025-05-07T20:26:04.7438721Z 
2025-05-07T20:26:04.7450303Z libcufft-11.3.3.41   | 147.4 MB  | #####8     |  59% [A[A[A[A
2025-05-07T20:26:04.7450586Z 
2025-05-07T20:26:04.7619093Z nsight-compute-2025. | 320.6 MB  | ##9        |  29% [A
2025-05-07T20:26:04.7748908Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  21% 
2025-05-07T20:26:04.7749270Z 
2025-05-07T20:26:04.7749275Z 
2025-05-07T20:26:04.8196110Z libcusparse-12.5.7.5 | 164.9 MB  | #####8     |  59% [A[A
2025-05-07T20:26:04.8196404Z 
2025-05-07T20:26:04.8196408Z 
2025-05-07T20:26:04.8200582Z 
2025-05-07T20:26:04.8443046Z libcusolver-11.7.2.5 | 156.9 MB  | #####8     |  59% [A[A[A
2025-05-07T20:26:04.8443405Z 
2025-05-07T20:26:04.8443410Z 
2025-05-07T20:26:04.8443413Z 
2025-05-07T20:26:04.8443433Z 
2025-05-07T20:26:04.8502970Z libcufft-11.3.3.41   | 147.4 MB  | ######     |  61% [A[A[A[A
2025-05-07T20:26:04.8503246Z 
2025-05-07T20:26:04.8650509Z nsight-compute-2025. | 320.6 MB  | ###        |  31% [A
2025-05-07T20:26:04.8828091Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:26:04.8828389Z 
2025-05-07T20:26:04.8829808Z 
2025-05-07T20:26:04.9198499Z libcusparse-12.5.7.5 | 164.9 MB  | ######     |  61% [A[A
2025-05-07T20:26:04.9198794Z 
2025-05-07T20:26:04.9198798Z 
2025-05-07T20:26:04.9199610Z 
2025-05-07T20:26:04.9447573Z libcusolver-11.7.2.5 | 156.9 MB  | ######     |  61% [A[A[A
2025-05-07T20:26:04.9447933Z 
2025-05-07T20:26:04.9447937Z 
2025-05-07T20:26:04.9447941Z 
2025-05-07T20:26:04.9447945Z 
2025-05-07T20:26:04.9503292Z libcufft-11.3.3.41   | 147.4 MB  | ######3    |  63% [A[A[A[A
2025-05-07T20:26:04.9505756Z 
2025-05-07T20:26:04.9697175Z nsight-compute-2025. | 320.6 MB  | ###1       |  32% [A
2025-05-07T20:26:04.9844500Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  23% 
2025-05-07T20:26:04.9844783Z 
2025-05-07T20:26:04.9846871Z 
2025-05-07T20:26:05.0204361Z libcusparse-12.5.7.5 | 164.9 MB  | ######2    |  62% [A[A
2025-05-07T20:26:05.0204653Z 
2025-05-07T20:26:05.0204657Z 
2025-05-07T20:26:05.0207404Z 
2025-05-07T20:26:05.0468099Z libcusolver-11.7.2.5 | 156.9 MB  | ######2    |  63% [A[A[A
2025-05-07T20:26:05.0468417Z 
2025-05-07T20:26:05.0468421Z 
2025-05-07T20:26:05.0468425Z 
2025-05-07T20:26:05.0470289Z 
2025-05-07T20:26:05.0506083Z libcufft-11.3.3.41   | 147.4 MB  | ######5    |  66% [A[A[A[A
2025-05-07T20:26:05.0506679Z 
2025-05-07T20:26:05.0739959Z nsight-compute-2025. | 320.6 MB  | ###2       |  33% [A
2025-05-07T20:26:05.0893364Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:26:05.0893734Z 
2025-05-07T20:26:05.0895342Z 
2025-05-07T20:26:05.1390157Z libcusparse-12.5.7.5 | 164.9 MB  | ######4    |  64% [A[A
2025-05-07T20:26:05.1390588Z 
2025-05-07T20:26:05.1390595Z 
2025-05-07T20:26:05.1392724Z 
2025-05-07T20:26:05.1485349Z libcusolver-11.7.2.5 | 156.9 MB  | ######4    |  65% [A[A[A
2025-05-07T20:26:05.1485743Z 
2025-05-07T20:26:05.1485747Z 
2025-05-07T20:26:05.1485751Z 
2025-05-07T20:26:05.1489085Z 
2025-05-07T20:26:05.1506705Z libcufft-11.3.3.41   | 147.4 MB  | ######7    |  68% [A[A[A[A
2025-05-07T20:26:05.1506986Z 
2025-05-07T20:26:05.1742397Z nsight-compute-2025. | 320.6 MB  | ###3       |  34% [A
2025-05-07T20:26:05.1894559Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  24% 
2025-05-07T20:26:05.1894911Z 
2025-05-07T20:26:05.1897520Z 
2025-05-07T20:26:05.2396388Z libcusparse-12.5.7.5 | 164.9 MB  | ######6    |  66% [A[A
2025-05-07T20:26:05.2396822Z 
2025-05-07T20:26:05.2396829Z 
2025-05-07T20:26:05.2400452Z 
2025-05-07T20:26:05.2550818Z libcusolver-11.7.2.5 | 156.9 MB  | ######6    |  66% [A[A[A
2025-05-07T20:26:05.2551107Z 
2025-05-07T20:26:05.2606904Z nsight-compute-2025. | 320.6 MB  | ###4       |  35% [A
2025-05-07T20:26:05.2607192Z 
2025-05-07T20:26:05.2607196Z 
2025-05-07T20:26:05.2607225Z 
2025-05-07T20:26:05.2609814Z 
2025-05-07T20:26:05.2742674Z libcufft-11.3.3.41   | 147.4 MB  | #######    |  70% [A[A[A[A
2025-05-07T20:26:05.2896239Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  25% 
2025-05-07T20:26:05.2896657Z 
2025-05-07T20:26:05.2899052Z 
2025-05-07T20:26:05.3396674Z libcusparse-12.5.7.5 | 164.9 MB  | ######8    |  68% [A[A
2025-05-07T20:26:05.3396966Z 
2025-05-07T20:26:05.3396969Z 
2025-05-07T20:26:05.3397705Z 
2025-05-07T20:26:05.3603809Z libcusolver-11.7.2.5 | 156.9 MB  | ######8    |  69% [A[A[A
2025-05-07T20:26:05.3605431Z 
2025-05-07T20:26:05.3651221Z nsight-compute-2025. | 320.6 MB  | ###6       |  36% [A
2025-05-07T20:26:05.3651556Z 
2025-05-07T20:26:05.3651562Z 
2025-05-07T20:26:05.3651568Z 
2025-05-07T20:26:05.3651573Z 
2025-05-07T20:26:05.3789171Z libcufft-11.3.3.41   | 147.4 MB  | #######2   |  72% [A[A[A[A
2025-05-07T20:26:05.3936108Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  26% 
2025-05-07T20:26:05.3936433Z 
2025-05-07T20:26:05.3936469Z 
2025-05-07T20:26:05.4400104Z libcusparse-12.5.7.5 | 164.9 MB  | #######    |  70% [A[A
2025-05-07T20:26:05.4400614Z 
2025-05-07T20:26:05.4400618Z 
2025-05-07T20:26:05.4401444Z 
2025-05-07T20:26:05.4614130Z libcusolver-11.7.2.5 | 156.9 MB  | #######    |  70% [A[A[A
2025-05-07T20:26:05.4615398Z 
2025-05-07T20:26:05.4652817Z nsight-compute-2025. | 320.6 MB  | ###7       |  37% [A
2025-05-07T20:26:05.4653110Z 
2025-05-07T20:26:05.4653114Z 
2025-05-07T20:26:05.4653118Z 
2025-05-07T20:26:05.4653121Z 
2025-05-07T20:26:05.4791931Z libcufft-11.3.3.41   | 147.4 MB  | #######4   |  75% [A[A[A[A
2025-05-07T20:26:05.4936998Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  27% 
2025-05-07T20:26:05.4937317Z 
2025-05-07T20:26:05.4939856Z 
2025-05-07T20:26:05.5402542Z libcusparse-12.5.7.5 | 164.9 MB  | #######2   |  72% [A[A
2025-05-07T20:26:05.5402971Z 
2025-05-07T20:26:05.5402977Z 
2025-05-07T20:26:05.5404537Z 
2025-05-07T20:26:05.5652918Z libcusolver-11.7.2.5 | 156.9 MB  | #######2   |  73% [A[A[A
2025-05-07T20:26:05.5653550Z 
2025-05-07T20:26:05.5658854Z nsight-compute-2025. | 320.6 MB  | ###8       |  38% [A
2025-05-07T20:26:05.5659233Z 
2025-05-07T20:26:05.5659240Z 
2025-05-07T20:26:05.5659262Z 
2025-05-07T20:26:05.5659267Z 
2025-05-07T20:26:05.5797542Z libcufft-11.3.3.41   | 147.4 MB  | #######6   |  77% [A[A[A[A
2025-05-07T20:26:05.5938883Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  27% 
2025-05-07T20:26:05.5939278Z 
2025-05-07T20:26:05.5941128Z 
2025-05-07T20:26:05.6405558Z libcusparse-12.5.7.5 | 164.9 MB  | #######4   |  74% [A[A
2025-05-07T20:26:05.6405966Z 
2025-05-07T20:26:05.6405972Z 
2025-05-07T20:26:05.6406346Z 
2025-05-07T20:26:05.6655006Z libcusolver-11.7.2.5 | 156.9 MB  | #######5   |  75% [A[A[A
2025-05-07T20:26:05.6661655Z 
2025-05-07T20:26:05.6683022Z nsight-compute-2025. | 320.6 MB  | ###9       |  39% [A
2025-05-07T20:26:05.6683411Z 
2025-05-07T20:26:05.6683416Z 
2025-05-07T20:26:05.6683422Z 
2025-05-07T20:26:05.6687035Z 
2025-05-07T20:26:05.6803955Z libcufft-11.3.3.41   | 147.4 MB  | #######9   |  79% [A[A[A[A
2025-05-07T20:26:05.6944140Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  28% 
2025-05-07T20:26:05.6944396Z 
2025-05-07T20:26:05.6944634Z 
2025-05-07T20:26:05.7409016Z libcusparse-12.5.7.5 | 164.9 MB  | #######6   |  77% [A[A
2025-05-07T20:26:05.7409480Z 
2025-05-07T20:26:05.7409486Z 
2025-05-07T20:26:05.7409787Z 
2025-05-07T20:26:05.7660479Z libcusolver-11.7.2.5 | 156.9 MB  | #######7   |  77% [A[A[A
2025-05-07T20:26:05.7662704Z 
2025-05-07T20:26:05.7686751Z nsight-compute-2025. | 320.6 MB  | ####       |  40% [A
2025-05-07T20:26:05.7687102Z 
2025-05-07T20:26:05.7687109Z 
2025-05-07T20:26:05.7687114Z 
2025-05-07T20:26:05.7688499Z 
2025-05-07T20:26:05.7805967Z libcufft-11.3.3.41   | 147.4 MB  | ########1  |  82% [A[A[A[A
2025-05-07T20:26:05.7944949Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  29% 
2025-05-07T20:26:05.7945292Z 
2025-05-07T20:26:05.7947862Z 
2025-05-07T20:26:05.8411336Z libcusparse-12.5.7.5 | 164.9 MB  | #######8   |  79% [A[A
2025-05-07T20:26:05.8411728Z 
2025-05-07T20:26:05.8411733Z 
2025-05-07T20:26:05.8412272Z 
2025-05-07T20:26:05.8664842Z libcusolver-11.7.2.5 | 156.9 MB  | ########   |  80% [A[A[A
2025-05-07T20:26:05.8668635Z 
2025-05-07T20:26:05.8692530Z nsight-compute-2025. | 320.6 MB  | ####1      |  41% [A
2025-05-07T20:26:05.8692866Z 
2025-05-07T20:26:05.8692872Z 
2025-05-07T20:26:05.8692876Z 
2025-05-07T20:26:05.8694039Z 
2025-05-07T20:26:05.8835594Z libcufft-11.3.3.41   | 147.4 MB  | ########3  |  84% [A[A[A[A
2025-05-07T20:26:05.8947565Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  30% 
2025-05-07T20:26:05.8947926Z 
2025-05-07T20:26:05.8947931Z 
2025-05-07T20:26:05.9416567Z libcusparse-12.5.7.5 | 164.9 MB  | ########1  |  81% [A[A
2025-05-07T20:26:05.9416932Z 
2025-05-07T20:26:05.9416936Z 
2025-05-07T20:26:05.9416954Z 
2025-05-07T20:26:05.9695257Z libcusolver-11.7.2.5 | 156.9 MB  | ########2  |  82% [A[A[A
2025-05-07T20:26:05.9696727Z 
2025-05-07T20:26:05.9734706Z nsight-compute-2025. | 320.6 MB  | ####2      |  43% [A
2025-05-07T20:26:05.9734993Z 
2025-05-07T20:26:05.9734997Z 
2025-05-07T20:26:05.9735000Z 
2025-05-07T20:26:05.9735004Z 
2025-05-07T20:26:05.9838398Z libcufft-11.3.3.41   | 147.4 MB  | ########5  |  86% [A[A[A[A
2025-05-07T20:26:05.9950294Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  31% 
2025-05-07T20:26:05.9950734Z 
2025-05-07T20:26:05.9952276Z 
2025-05-07T20:26:06.0420639Z libcusparse-12.5.7.5 | 164.9 MB  | ########3  |  84% [A[A
2025-05-07T20:26:06.0421019Z 
2025-05-07T20:26:06.0421023Z 
2025-05-07T20:26:06.0423193Z 
2025-05-07T20:26:06.0725784Z libcusolver-11.7.2.5 | 156.9 MB  | ########4  |  85% [A[A[A
2025-05-07T20:26:06.0727499Z 
2025-05-07T20:26:06.0770380Z nsight-compute-2025. | 320.6 MB  | ####3      |  44% [A
2025-05-07T20:26:06.0770753Z 
2025-05-07T20:26:06.0770757Z 
2025-05-07T20:26:06.0770761Z 
2025-05-07T20:26:06.0771402Z 
2025-05-07T20:26:06.0843978Z libcufft-11.3.3.41   | 147.4 MB  | ########8  |  88% [A[A[A[A
2025-05-07T20:26:06.0954209Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  31% 
2025-05-07T20:26:06.0954479Z 
2025-05-07T20:26:06.0954483Z 
2025-05-07T20:26:06.1429532Z libcusparse-12.5.7.5 | 164.9 MB  | ########5  |  86% [A[A
2025-05-07T20:26:06.1429941Z 
2025-05-07T20:26:06.1429947Z 
2025-05-07T20:26:06.1434537Z 
2025-05-07T20:26:06.1747642Z libcusolver-11.7.2.5 | 156.9 MB  | ########7  |  87% [A[A[A
2025-05-07T20:26:06.1751112Z 
2025-05-07T20:26:06.1773403Z nsight-compute-2025. | 320.6 MB  | ####4      |  45% [A
2025-05-07T20:26:06.1773759Z 
2025-05-07T20:26:06.1773763Z 
2025-05-07T20:26:06.1773767Z 
2025-05-07T20:26:06.1774476Z 
2025-05-07T20:26:06.1895505Z libcufft-11.3.3.41   | 147.4 MB  | #########  |  90% [A[A[A[A
2025-05-07T20:26:06.1954514Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  32% 
2025-05-07T20:26:06.1954900Z 
2025-05-07T20:26:06.1956830Z 
2025-05-07T20:26:06.2471065Z libcusparse-12.5.7.5 | 164.9 MB  | ########8  |  88% [A[A
2025-05-07T20:26:06.2471357Z 
2025-05-07T20:26:06.2471361Z 
2025-05-07T20:26:06.2473861Z 
2025-05-07T20:26:06.2754396Z libcusolver-11.7.2.5 | 156.9 MB  | ########9  |  89% [A[A[A
2025-05-07T20:26:06.2754735Z 
2025-05-07T20:26:06.2779918Z nsight-compute-2025. | 320.6 MB  | ####5      |  46% [A
2025-05-07T20:26:06.2780552Z 
2025-05-07T20:26:06.2780556Z 
2025-05-07T20:26:06.2780559Z 
2025-05-07T20:26:06.2784958Z 
2025-05-07T20:26:06.2902510Z libcufft-11.3.3.41   | 147.4 MB  | #########2 |  93% [A[A[A[A
2025-05-07T20:26:06.3000567Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  33% 
2025-05-07T20:26:06.3000843Z 
2025-05-07T20:26:06.3002043Z 
2025-05-07T20:26:06.3473311Z libcusparse-12.5.7.5 | 164.9 MB  | #########  |  90% [A[A
2025-05-07T20:26:06.3473657Z 
2025-05-07T20:26:06.3473663Z 
2025-05-07T20:26:06.3475138Z 
2025-05-07T20:26:06.3754909Z libcusolver-11.7.2.5 | 156.9 MB  | #########1 |  92% [A[A[A
2025-05-07T20:26:06.3756696Z 
2025-05-07T20:26:06.3844356Z nsight-compute-2025. | 320.6 MB  | ####6      |  47% [A
2025-05-07T20:26:06.3844740Z 
2025-05-07T20:26:06.3844747Z 
2025-05-07T20:26:06.3844771Z 
2025-05-07T20:26:06.3844774Z 
2025-05-07T20:26:06.3925867Z libcufft-11.3.3.41   | 147.4 MB  | #########5 |  95% [A[A[A[A
2025-05-07T20:26:06.4002378Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  34% 
2025-05-07T20:26:06.4002697Z 
2025-05-07T20:26:06.4005054Z 
2025-05-07T20:26:06.4477476Z libcusparse-12.5.7.5 | 164.9 MB  | #########2 |  93% [A[A
2025-05-07T20:26:06.4477972Z 
2025-05-07T20:26:06.4477978Z 
2025-05-07T20:26:06.4479354Z 
2025-05-07T20:26:06.4763150Z libcusolver-11.7.2.5 | 156.9 MB  | #########4 |  94% [A[A[A
2025-05-07T20:26:06.4763506Z 
2025-05-07T20:26:06.4846548Z nsight-compute-2025. | 320.6 MB  | ####8      |  48% [A
2025-05-07T20:26:06.4846894Z 
2025-05-07T20:26:06.4846898Z 
2025-05-07T20:26:06.4846902Z 
2025-05-07T20:26:06.4846905Z 
2025-05-07T20:26:06.4965785Z libcufft-11.3.3.41   | 147.4 MB  | #########7 |  97% [A[A[A[A
2025-05-07T20:26:06.5005397Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  35% 
2025-05-07T20:26:06.5005657Z 
2025-05-07T20:26:06.5008939Z 
2025-05-07T20:26:06.5515446Z libcusparse-12.5.7.5 | 164.9 MB  | #########4 |  95% [A[A
2025-05-07T20:26:06.5515746Z 
2025-05-07T20:26:06.5515749Z 
2025-05-07T20:26:06.5518314Z 
2025-05-07T20:26:06.5764928Z libcusolver-11.7.2.5 | 156.9 MB  | #########6 |  97% [A[A[A
2025-05-07T20:26:06.5766037Z 
2025-05-07T20:26:06.5847017Z nsight-compute-2025. | 320.6 MB  | ####9      |  49% [A
2025-05-07T20:26:06.5847302Z 
2025-05-07T20:26:06.5847306Z 
2025-05-07T20:26:06.5847311Z 
2025-05-07T20:26:06.5847482Z 
2025-05-07T20:26:06.5976077Z libcufft-11.3.3.41   | 147.4 MB  | #########9 |  99% [A[A[A[A
2025-05-07T20:26:06.6059300Z libcublas-12.8.3.14  | 460.2 MB  | ###5       |  35% 
2025-05-07T20:26:06.6059660Z 
2025-05-07T20:26:06.6060869Z 
2025-05-07T20:26:06.6515532Z libcusparse-12.5.7.5 | 164.9 MB  | #########7 |  97% [A[A
2025-05-07T20:26:06.6515868Z 
2025-05-07T20:26:06.6515872Z 
2025-05-07T20:26:06.6518600Z 
2025-05-07T20:26:06.6766813Z libcusolver-11.7.2.5 | 156.9 MB  | #########9 |  99% [A[A[A
2025-05-07T20:26:06.6767615Z 
2025-05-07T20:26:06.6980907Z nsight-compute-2025. | 320.6 MB  | #####      |  50% [A
2025-05-07T20:26:06.7062864Z libcublas-12.8.3.14  | 460.2 MB  | ###6       |  36% 
2025-05-07T20:26:06.7063147Z 
2025-05-07T20:26:06.7063780Z 
2025-05-07T20:26:06.7767135Z libcusparse-12.5.7.5 | 164.9 MB  | #########9 | 100% [A[A
2025-05-07T20:26:06.7768133Z 
2025-05-07T20:26:06.7981836Z nsight-compute-2025. | 320.6 MB  | #####1     |  52% [A
2025-05-07T20:26:06.8768854Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  37% 
2025-05-07T20:26:06.8771717Z 
2025-05-07T20:26:06.8984483Z nsight-compute-2025. | 320.6 MB  | #####3     |  53% [A
2025-05-07T20:26:06.9770004Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  38% 
2025-05-07T20:26:06.9771802Z 
2025-05-07T20:26:06.9986517Z nsight-compute-2025. | 320.6 MB  | #####4     |  55% [A
2025-05-07T20:26:07.0770071Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  39% 
2025-05-07T20:26:07.0770579Z 
2025-05-07T20:26:07.0987180Z nsight-compute-2025. | 320.6 MB  | #####6     |  57% [A
2025-05-07T20:26:07.1773513Z libcublas-12.8.3.14  | 460.2 MB  | ####       |  40% 
2025-05-07T20:26:07.1773879Z 
2025-05-07T20:26:07.1991196Z nsight-compute-2025. | 320.6 MB  | #####8     |  58% [A
2025-05-07T20:26:07.2774507Z libcublas-12.8.3.14  | 460.2 MB  | ####1      |  41% 
2025-05-07T20:26:07.2777457Z 
2025-05-07T20:26:07.2991507Z nsight-compute-2025. | 320.6 MB  | #####9     |  60% [A
2025-05-07T20:26:07.3776156Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  43% 
2025-05-07T20:26:07.3776573Z 
2025-05-07T20:26:07.4137045Z nsight-compute-2025. | 320.6 MB  | ######1    |  61% [A
2025-05-07T20:26:07.4776604Z libcublas-12.8.3.14  | 460.2 MB  | ####3      |  44% 
2025-05-07T20:26:07.4776867Z 
2025-05-07T20:26:07.5542509Z nsight-compute-2025. | 320.6 MB  | ######3    |  63% [A
2025-05-07T20:26:07.5778326Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  45% 
2025-05-07T20:26:07.5779287Z 
2025-05-07T20:26:07.6543040Z nsight-compute-2025. | 320.6 MB  | ######5    |  65% [A
2025-05-07T20:26:07.6960904Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  46% 
2025-05-07T20:26:07.6961285Z 
2025-05-07T20:26:07.7545608Z nsight-compute-2025. | 320.6 MB  | ######7    |  67% [A
2025-05-07T20:26:07.8157429Z libcublas-12.8.3.14  | 460.2 MB  | ####6      |  47% 
2025-05-07T20:26:07.8159556Z 
2025-05-07T20:26:07.8546847Z nsight-compute-2025. | 320.6 MB  | ######8    |  69% [A
2025-05-07T20:26:07.9175165Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  48% 
2025-05-07T20:26:07.9175863Z 
2025-05-07T20:26:07.9550156Z nsight-compute-2025. | 320.6 MB  | #######    |  70% [A
2025-05-07T20:26:08.0268751Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  49% 
2025-05-07T20:26:08.0269519Z 
2025-05-07T20:26:08.0551347Z nsight-compute-2025. | 320.6 MB  | #######2   |  72% [A
2025-05-07T20:26:08.1426335Z libcublas-12.8.3.14  | 460.2 MB  | #####      |  50% 
2025-05-07T20:26:08.1426674Z 
2025-05-07T20:26:08.1551819Z nsight-compute-2025. | 320.6 MB  | #######3   |  74% [A
2025-05-07T20:26:08.2426964Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  52% 
2025-05-07T20:26:08.2427301Z 
2025-05-07T20:26:08.2556231Z nsight-compute-2025. | 320.6 MB  | #######5   |  75% [A
2025-05-07T20:26:08.3431453Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  53% 
2025-05-07T20:26:08.3431830Z 
2025-05-07T20:26:08.3562912Z nsight-compute-2025. | 320.6 MB  | #######6   |  77% [A
2025-05-07T20:26:08.4488248Z libcublas-12.8.3.14  | 460.2 MB  | #####3     |  54% 
2025-05-07T20:26:08.4488966Z 
2025-05-07T20:26:08.4598715Z nsight-compute-2025. | 320.6 MB  | #######8   |  78% [A
2025-05-07T20:26:08.5488838Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  55% 
2025-05-07T20:26:08.5489222Z 
2025-05-07T20:26:08.5598696Z nsight-compute-2025. | 320.6 MB  | #######9   |  80% [A
2025-05-07T20:26:08.6521324Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  56% 
2025-05-07T20:26:08.6524457Z 
2025-05-07T20:26:08.6607269Z nsight-compute-2025. | 320.6 MB  | ########1  |  81% [A
2025-05-07T20:26:08.7522744Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  57% 
2025-05-07T20:26:08.7523280Z 
2025-05-07T20:26:08.7616975Z nsight-compute-2025. | 320.6 MB  | ########2  |  83% [A
2025-05-07T20:26:08.8574129Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  58% 
2025-05-07T20:26:08.8575004Z 
2025-05-07T20:26:08.8628650Z nsight-compute-2025. | 320.6 MB  | ########4  |  84% [A
2025-05-07T20:26:08.9629390Z libcublas-12.8.3.14  | 460.2 MB  | #####9     |  59% 
2025-05-07T20:26:08.9630085Z 
2025-05-07T20:26:08.9634078Z nsight-compute-2025. | 320.6 MB  | ########5  |  86% [A
2025-05-07T20:26:09.0629713Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  61% 
2025-05-07T20:26:09.0630079Z 
2025-05-07T20:26:09.0659450Z nsight-compute-2025. | 320.6 MB  | ########7  |  87% [A
2025-05-07T20:26:09.1632722Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  62% 
2025-05-07T20:26:09.1633388Z 
2025-05-07T20:26:09.1661483Z nsight-compute-2025. | 320.6 MB  | ########8  |  89% [A
2025-05-07T20:26:09.2665833Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  63% 
2025-05-07T20:26:09.2797909Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  64% 
2025-05-07T20:26:09.2798848Z 
2025-05-07T20:26:09.3665798Z nsight-compute-2025. | 320.6 MB  | #########  |  90% [A
2025-05-07T20:26:09.3988779Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  65% 
2025-05-07T20:26:09.3989204Z 
2025-05-07T20:26:09.4725488Z nsight-compute-2025. | 320.6 MB  | #########1 |  92% [A
2025-05-07T20:26:09.4989770Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:26:09.4992140Z 
2025-05-07T20:26:09.5238784Z nsight-compute-2025. | 320.6 MB  | #########3 |  93% [A
2025-05-07T20:26:09.5239127Z 
2025-05-07T20:26:09.5239133Z 
2025-05-07T20:26:09.5239138Z 
2025-05-07T20:26:09.5239143Z 
2025-05-07T20:26:09.5867655Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:09.5867952Z 
2025-05-07T20:26:09.5867956Z 
2025-05-07T20:26:09.5867959Z 
2025-05-07T20:26:09.5867963Z 
2025-05-07T20:26:09.5871571Z 
2025-05-07T20:26:09.6016039Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:09.6188925Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  68% 
2025-05-07T20:26:09.6190584Z 
2025-05-07T20:26:09.6868764Z nsight-compute-2025. | 320.6 MB  | #########4 |  94% [A
2025-05-07T20:26:09.6869050Z 
2025-05-07T20:26:09.6869055Z 
2025-05-07T20:26:09.6869058Z 
2025-05-07T20:26:09.6869062Z 
2025-05-07T20:26:09.6871292Z 
2025-05-07T20:26:09.7404565Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   3% [A[A[A[A[A
2025-05-07T20:26:09.7406297Z 
2025-05-07T20:26:09.7433377Z nsight-compute-2025. | 320.6 MB  | #########5 |  96% [A
2025-05-07T20:26:09.7870704Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:26:09.7871030Z 
2025-05-07T20:26:09.7871034Z 
2025-05-07T20:26:09.7871038Z 
2025-05-07T20:26:09.7871042Z 
2025-05-07T20:26:09.7872502Z 
2025-05-07T20:26:09.8411648Z libnpp-12.3.3.65     | 130.6 MB  | 5          |   6% [A[A[A[A[A
2025-05-07T20:26:09.8411940Z 
2025-05-07T20:26:09.8411944Z 
2025-05-07T20:26:09.8421144Z 
2025-05-07T20:26:09.8571849Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:09.8572146Z 
2025-05-07T20:26:09.8573518Z 
2025-05-07T20:26:09.8776935Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:09.8779267Z 
2025-05-07T20:26:09.8794048Z nsight-compute-2025. | 320.6 MB  | #########7 |  97% [A
2025-05-07T20:26:09.8875612Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  70% 
2025-05-07T20:26:09.8876025Z 
2025-05-07T20:26:09.8876032Z 
2025-05-07T20:26:09.8876037Z 
2025-05-07T20:26:09.8876042Z 
2025-05-07T20:26:09.8878164Z 
2025-05-07T20:26:09.8936290Z libnpp-12.3.3.65     | 130.6 MB  | 8          |   8% [A[A[A[A[A
2025-05-07T20:26:09.8936593Z 
2025-05-07T20:26:09.8936597Z 
2025-05-07T20:26:09.8936600Z 
2025-05-07T20:26:09.8936604Z 
2025-05-07T20:26:09.8936608Z 
2025-05-07T20:26:09.8937896Z 
2025-05-07T20:26:09.9041861Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:09.9042165Z 
2025-05-07T20:26:09.9042181Z 
2025-05-07T20:26:09.9042185Z 
2025-05-07T20:26:09.9042188Z 
2025-05-07T20:26:09.9042192Z 
2025-05-07T20:26:09.9042196Z 
2025-05-07T20:26:09.9045212Z 
2025-05-07T20:26:09.9944993Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:09.9945304Z 
2025-05-07T20:26:09.9945308Z 
2025-05-07T20:26:09.9945312Z 
2025-05-07T20:26:09.9945323Z 
2025-05-07T20:26:09.9945327Z 
2025-05-07T20:26:09.9955790Z 
2025-05-07T20:26:10.0050320Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   2% [A[A[A[A[A[A
2025-05-07T20:26:10.0050636Z 
2025-05-07T20:26:10.0050649Z 
2025-05-07T20:26:10.0050653Z 
2025-05-07T20:26:10.0050657Z 
2025-05-07T20:26:10.0050660Z 
2025-05-07T20:26:10.0050664Z 
2025-05-07T20:26:10.0052257Z 
2025-05-07T20:26:10.0169599Z cuda-nvvp-12.8.57    | 112.4 MB  | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:26:10.0169908Z 
2025-05-07T20:26:10.0169913Z 
2025-05-07T20:26:10.0169916Z 
2025-05-07T20:26:10.0169920Z 
2025-05-07T20:26:10.0169925Z 
2025-05-07T20:26:10.0238980Z libnpp-12.3.3.65     | 130.6 MB  | #          |  11% [A[A[A[A[A
2025-05-07T20:26:10.0239334Z 
2025-05-07T20:26:10.0434901Z nsight-compute-2025. | 320.6 MB  | #########8 |  98% [A
2025-05-07T20:26:10.0951009Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  71% 
2025-05-07T20:26:10.0951359Z 
2025-05-07T20:26:10.0951363Z 
2025-05-07T20:26:10.0951367Z 
2025-05-07T20:26:10.0951371Z 
2025-05-07T20:26:10.0951375Z 
2025-05-07T20:26:10.0952616Z 
2025-05-07T20:26:10.1058872Z cuda-nsight-12.8.55  | 113.2 MB  | 4          |   4% [A[A[A[A[A[A
2025-05-07T20:26:10.1059210Z 
2025-05-07T20:26:10.1059214Z 
2025-05-07T20:26:10.1059218Z 
2025-05-07T20:26:10.1059221Z 
2025-05-07T20:26:10.1059225Z 
2025-05-07T20:26:10.1059228Z 
2025-05-07T20:26:10.1059232Z 
2025-05-07T20:26:10.1334151Z cuda-nvvp-12.8.57    | 112.4 MB  | 3          |   4% [A[A[A[A[A[A[A
2025-05-07T20:26:10.1334526Z 
2025-05-07T20:26:10.1334529Z 
2025-05-07T20:26:10.1334533Z 
2025-05-07T20:26:10.1334537Z 
2025-05-07T20:26:10.1334564Z 
2025-05-07T20:26:10.1659782Z libnpp-12.3.3.65     | 130.6 MB  | #2         |  13% [A[A[A[A[A
2025-05-07T20:26:10.1663675Z 
2025-05-07T20:26:10.1892876Z nsight-compute-2025. | 320.6 MB  | #########9 |  99% [A
2025-05-07T20:26:10.1955920Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  71% 
2025-05-07T20:26:10.1956281Z 
2025-05-07T20:26:10.1956287Z 
2025-05-07T20:26:10.1956293Z 
2025-05-07T20:26:10.1956298Z 
2025-05-07T20:26:10.1956303Z 
2025-05-07T20:26:10.1958678Z 
2025-05-07T20:26:10.2059157Z cuda-nsight-12.8.55  | 113.2 MB  | 5          |   6% [A[A[A[A[A[A
2025-05-07T20:26:10.2059632Z 
2025-05-07T20:26:10.2059636Z 
2025-05-07T20:26:10.2059640Z 
2025-05-07T20:26:10.2059644Z 
2025-05-07T20:26:10.2059647Z 
2025-05-07T20:26:10.2059651Z 
2025-05-07T20:26:10.2063347Z 
2025-05-07T20:26:10.2589243Z cuda-nvvp-12.8.57    | 112.4 MB  | 6          |   6% [A[A[A[A[A[A[A
2025-05-07T20:26:10.2589661Z 
2025-05-07T20:26:10.2589668Z 
2025-05-07T20:26:10.2589673Z 
2025-05-07T20:26:10.2589679Z 
2025-05-07T20:26:10.2589704Z 
2025-05-07T20:26:10.2965664Z libnpp-12.3.3.65     | 130.6 MB  | #4         |  15% [A[A[A[A[A
2025-05-07T20:26:10.2966053Z 
2025-05-07T20:26:10.2966059Z 
2025-05-07T20:26:10.2966081Z 
2025-05-07T20:26:10.2966086Z 
2025-05-07T20:26:10.2966091Z 
2025-05-07T20:26:10.2968072Z 
2025-05-07T20:26:10.3065638Z cuda-nsight-12.8.55  | 113.2 MB  | 7          |   8% [A[A[A[A[A[A
2025-05-07T20:26:10.3066063Z 
2025-05-07T20:26:10.3066068Z 
2025-05-07T20:26:10.3066074Z 
2025-05-07T20:26:10.3066079Z 
2025-05-07T20:26:10.3066084Z 
2025-05-07T20:26:10.3066089Z 
2025-05-07T20:26:10.3066094Z 
2025-05-07T20:26:10.3308772Z cuda-nvvp-12.8.57    | 112.4 MB  | 8          |   8% [A[A[A[A[A[A[A
2025-05-07T20:26:10.3632475Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:26:10.3632847Z 
2025-05-07T20:26:10.3632853Z 
2025-05-07T20:26:10.3632858Z 
2025-05-07T20:26:10.3632863Z 
2025-05-07T20:26:10.3634392Z 
2025-05-07T20:26:10.3966130Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:26:10.3966430Z 
2025-05-07T20:26:10.3966434Z 
2025-05-07T20:26:10.3966437Z 
2025-05-07T20:26:10.3966441Z 
2025-05-07T20:26:10.3966445Z 
2025-05-07T20:26:10.3968434Z 
2025-05-07T20:26:10.4065961Z cuda-nsight-12.8.55  | 113.2 MB  | #          |  10% [A[A[A[A[A[A
2025-05-07T20:26:10.4066278Z 
2025-05-07T20:26:10.4066282Z 
2025-05-07T20:26:10.4066285Z 
2025-05-07T20:26:10.4066289Z 
2025-05-07T20:26:10.4066293Z 
2025-05-07T20:26:10.4066296Z 
2025-05-07T20:26:10.4068152Z 
2025-05-07T20:26:10.4402218Z cuda-nvvp-12.8.57    | 112.4 MB  | #          |  11% [A[A[A[A[A[A[A
2025-05-07T20:26:10.4718357Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:26:10.4718630Z 
2025-05-07T20:26:10.4718634Z 
2025-05-07T20:26:10.4718638Z 
2025-05-07T20:26:10.4718641Z 
2025-05-07T20:26:10.4720973Z 
2025-05-07T20:26:10.4967892Z libnpp-12.3.3.65     | 130.6 MB  | #8         |  19% [A[A[A[A[A
2025-05-07T20:26:10.4968189Z 
2025-05-07T20:26:10.4968193Z 
2025-05-07T20:26:10.4968478Z 
2025-05-07T20:26:10.4968483Z 
2025-05-07T20:26:10.4968487Z 
2025-05-07T20:26:10.4973731Z 
2025-05-07T20:26:10.5067975Z cuda-nsight-12.8.55  | 113.2 MB  | #2         |  13% [A[A[A[A[A[A
2025-05-07T20:26:10.5068505Z 
2025-05-07T20:26:10.5068509Z 
2025-05-07T20:26:10.5068512Z 
2025-05-07T20:26:10.5068516Z 
2025-05-07T20:26:10.5068520Z 
2025-05-07T20:26:10.5068523Z 
2025-05-07T20:26:10.5069872Z 
2025-05-07T20:26:10.5450768Z cuda-nvvp-12.8.57    | 112.4 MB  | #3         |  13% [A[A[A[A[A[A[A
2025-05-07T20:26:10.5723372Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  73% 
2025-05-07T20:26:10.5723697Z 
2025-05-07T20:26:10.5723701Z 
2025-05-07T20:26:10.5723705Z 
2025-05-07T20:26:10.5723708Z 
2025-05-07T20:26:10.5725897Z 
2025-05-07T20:26:10.5977933Z libnpp-12.3.3.65     | 130.6 MB  | ##         |  21% [A[A[A[A[A
2025-05-07T20:26:10.5978229Z 
2025-05-07T20:26:10.5978233Z 
2025-05-07T20:26:10.5978237Z 
2025-05-07T20:26:10.5978241Z 
2025-05-07T20:26:10.5978244Z 
2025-05-07T20:26:10.5980402Z 
2025-05-07T20:26:10.6072237Z cuda-nsight-12.8.55  | 113.2 MB  | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:26:10.6072548Z 
2025-05-07T20:26:10.6072552Z 
2025-05-07T20:26:10.6072566Z 
2025-05-07T20:26:10.6072570Z 
2025-05-07T20:26:10.6072573Z 
2025-05-07T20:26:10.6072577Z 
2025-05-07T20:26:10.6073984Z 
2025-05-07T20:26:10.6515209Z cuda-nvvp-12.8.57    | 112.4 MB  | #5         |  15% [A[A[A[A[A[A[A
2025-05-07T20:26:10.6727480Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:26:10.6727772Z 
2025-05-07T20:26:10.6727776Z 
2025-05-07T20:26:10.6727780Z 
2025-05-07T20:26:10.6727784Z 
2025-05-07T20:26:10.6731926Z 
2025-05-07T20:26:10.7046718Z libnpp-12.3.3.65     | 130.6 MB  | ##2        |  23% [A[A[A[A[A
2025-05-07T20:26:10.7047007Z 
2025-05-07T20:26:10.7047011Z 
2025-05-07T20:26:10.7047015Z 
2025-05-07T20:26:10.7047019Z 
2025-05-07T20:26:10.7047022Z 
2025-05-07T20:26:10.7052373Z 
2025-05-07T20:26:10.7128304Z cuda-nsight-12.8.55  | 113.2 MB  | #6         |  17% [A[A[A[A[A[A
2025-05-07T20:26:10.7128653Z 
2025-05-07T20:26:10.7128659Z 
2025-05-07T20:26:10.7128664Z 
2025-05-07T20:26:10.7128669Z 
2025-05-07T20:26:10.7128674Z 
2025-05-07T20:26:10.7128680Z 
2025-05-07T20:26:10.7128787Z 
2025-05-07T20:26:10.7582100Z cuda-nvvp-12.8.57    | 112.4 MB  | #7         |  17% [A[A[A[A[A[A[A
2025-05-07T20:26:10.7822745Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  75% 
2025-05-07T20:26:10.7823013Z 
2025-05-07T20:26:10.7823018Z 
2025-05-07T20:26:10.7823021Z 
2025-05-07T20:26:10.7823025Z 
2025-05-07T20:26:10.7823084Z 
2025-05-07T20:26:10.8140836Z libnpp-12.3.3.65     | 130.6 MB  | ##4        |  25% [A[A[A[A[A
2025-05-07T20:26:10.8141231Z 
2025-05-07T20:26:10.8141235Z 
2025-05-07T20:26:10.8141239Z 
2025-05-07T20:26:10.8141242Z 
2025-05-07T20:26:10.8141246Z 
2025-05-07T20:26:10.8141249Z 
2025-05-07T20:26:10.8141253Z 
2025-05-07T20:26:10.8201882Z cuda-nvvp-12.8.57    | 112.4 MB  | #9         |  20% [A[A[A[A[A[A[A
2025-05-07T20:26:10.8202180Z 
2025-05-07T20:26:10.8202205Z 
2025-05-07T20:26:10.8202209Z 
2025-05-07T20:26:10.8202213Z 
2025-05-07T20:26:10.8202217Z 
2025-05-07T20:26:10.8204709Z 
2025-05-07T20:26:10.8645385Z cuda-nsight-12.8.55  | 113.2 MB  | #9         |  19% [A[A[A[A[A[A
2025-05-07T20:26:10.8930716Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  75% 
2025-05-07T20:26:10.8930997Z 
2025-05-07T20:26:10.8931001Z 
2025-05-07T20:26:10.8931005Z 
2025-05-07T20:26:10.8931008Z 
2025-05-07T20:26:10.8943245Z 
2025-05-07T20:26:10.9154880Z libnpp-12.3.3.65     | 130.6 MB  | ##6        |  27% [A[A[A[A[A
2025-05-07T20:26:10.9155224Z 
2025-05-07T20:26:10.9155228Z 
2025-05-07T20:26:10.9155232Z 
2025-05-07T20:26:10.9155236Z 
2025-05-07T20:26:10.9155239Z 
2025-05-07T20:26:10.9155243Z 
2025-05-07T20:26:10.9155247Z 
2025-05-07T20:26:10.9231088Z cuda-nvvp-12.8.57    | 112.4 MB  | ##1        |  22% [A[A[A[A[A[A[A
2025-05-07T20:26:10.9231384Z 
2025-05-07T20:26:10.9231388Z 
2025-05-07T20:26:10.9231392Z 
2025-05-07T20:26:10.9231395Z 
2025-05-07T20:26:10.9231399Z 
2025-05-07T20:26:10.9231633Z 
2025-05-07T20:26:10.9935190Z cuda-nsight-12.8.55  | 113.2 MB  | ##1        |  21% [A[A[A[A[A[A
2025-05-07T20:26:10.9935492Z 
2025-05-07T20:26:10.9935496Z 
2025-05-07T20:26:10.9935731Z 
2025-05-07T20:26:10.9935735Z 
2025-05-07T20:26:10.9940362Z 
2025-05-07T20:26:11.0158755Z libnpp-12.3.3.65     | 130.6 MB  | ##8        |  29% [A[A[A[A[A
2025-05-07T20:26:11.0159039Z 
2025-05-07T20:26:11.0159043Z 
2025-05-07T20:26:11.0159047Z 
2025-05-07T20:26:11.0159051Z 
2025-05-07T20:26:11.0159055Z 
2025-05-07T20:26:11.0159058Z 
2025-05-07T20:26:11.0161104Z 
2025-05-07T20:26:11.0235556Z cuda-nvvp-12.8.57    | 112.4 MB  | ##4        |  24% [A[A[A[A[A[A[A
2025-05-07T20:26:11.0238683Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  76% 
2025-05-07T20:26:11.0238974Z 
2025-05-07T20:26:11.0238979Z 
2025-05-07T20:26:11.0238982Z 
2025-05-07T20:26:11.0238986Z 
2025-05-07T20:26:11.0238990Z 
2025-05-07T20:26:11.0238995Z 
2025-05-07T20:26:11.0944610Z cuda-nsight-12.8.55  | 113.2 MB  | ##3        |  24% [A[A[A[A[A[A
2025-05-07T20:26:11.0944916Z 
2025-05-07T20:26:11.0944920Z 
2025-05-07T20:26:11.0944924Z 
2025-05-07T20:26:11.0944927Z 
2025-05-07T20:26:11.0952158Z 
2025-05-07T20:26:11.1239390Z libnpp-12.3.3.65     | 130.6 MB  | ###        |  31% [A[A[A[A[A
2025-05-07T20:26:11.1258162Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  77% 
2025-05-07T20:26:11.1258508Z 
2025-05-07T20:26:11.1258513Z 
2025-05-07T20:26:11.1258516Z 
2025-05-07T20:26:11.1258520Z 
2025-05-07T20:26:11.1258523Z 
2025-05-07T20:26:11.1260751Z 
2025-05-07T20:26:11.1268688Z cuda-nsight-12.8.55  | 113.2 MB  | ##5        |  26% [A[A[A[A[A[A
2025-05-07T20:26:11.1269001Z 
2025-05-07T20:26:11.1269005Z 
2025-05-07T20:26:11.1269008Z 
2025-05-07T20:26:11.1269012Z 
2025-05-07T20:26:11.1269016Z 
2025-05-07T20:26:11.1269019Z 
2025-05-07T20:26:11.1269023Z 
2025-05-07T20:26:11.2123344Z cuda-nvvp-12.8.57    | 112.4 MB  | ##6        |  27% [A[A[A[A[A[A[A
2025-05-07T20:26:11.2123638Z 
2025-05-07T20:26:11.2123642Z 
2025-05-07T20:26:11.2123664Z 
2025-05-07T20:26:11.2123668Z 
2025-05-07T20:26:11.2125537Z 
2025-05-07T20:26:11.2247621Z libnpp-12.3.3.65     | 130.6 MB  | ###2       |  33% [A[A[A[A[A
2025-05-07T20:26:11.2251047Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:26:11.2251418Z 
2025-05-07T20:26:11.2251424Z 
2025-05-07T20:26:11.2251429Z 
2025-05-07T20:26:11.2251434Z 
2025-05-07T20:26:11.2251439Z 
2025-05-07T20:26:11.2251445Z 
2025-05-07T20:26:11.2293302Z cuda-nsight-12.8.55  | 113.2 MB  | ##7        |  28% [A[A[A[A[A[A
2025-05-07T20:26:11.2293605Z 
2025-05-07T20:26:11.2293609Z 
2025-05-07T20:26:11.2293612Z 
2025-05-07T20:26:11.2293616Z 
2025-05-07T20:26:11.2293619Z 
2025-05-07T20:26:11.2293623Z 
2025-05-07T20:26:11.2295407Z 
2025-05-07T20:26:11.3129373Z cuda-nvvp-12.8.57    | 112.4 MB  | ##8        |  29% [A[A[A[A[A[A[A
2025-05-07T20:26:11.3129671Z 
2025-05-07T20:26:11.3129675Z 
2025-05-07T20:26:11.3129679Z 
2025-05-07T20:26:11.3129682Z 
2025-05-07T20:26:11.3129686Z 
2025-05-07T20:26:11.3251839Z libnpp-12.3.3.65     | 130.6 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:26:11.3345517Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:26:11.3345899Z 
2025-05-07T20:26:11.3345905Z 
2025-05-07T20:26:11.3345910Z 
2025-05-07T20:26:11.3345915Z 
2025-05-07T20:26:11.3345921Z 
2025-05-07T20:26:11.3345926Z 
2025-05-07T20:26:11.3413803Z cuda-nsight-12.8.55  | 113.2 MB  | ##9        |  30% [A[A[A[A[A[A
2025-05-07T20:26:11.3414105Z 
2025-05-07T20:26:11.3414109Z 
2025-05-07T20:26:11.3414113Z 
2025-05-07T20:26:11.3414116Z 
2025-05-07T20:26:11.3414120Z 
2025-05-07T20:26:11.3414123Z 
2025-05-07T20:26:11.3420323Z 
2025-05-07T20:26:11.4182279Z cuda-nvvp-12.8.57    | 112.4 MB  | ###1       |  31% [A[A[A[A[A[A[A
2025-05-07T20:26:11.4182582Z 
2025-05-07T20:26:11.4182586Z 
2025-05-07T20:26:11.4182589Z 
2025-05-07T20:26:11.4182593Z 
2025-05-07T20:26:11.4186441Z 
2025-05-07T20:26:11.4311067Z libnpp-12.3.3.65     | 130.6 MB  | ###6       |  36% [A[A[A[A[A
2025-05-07T20:26:11.4347042Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  78% 
2025-05-07T20:26:11.4347307Z 
2025-05-07T20:26:11.4347311Z 
2025-05-07T20:26:11.4347314Z 
2025-05-07T20:26:11.4347318Z 
2025-05-07T20:26:11.4347464Z 
2025-05-07T20:26:11.4347468Z 
2025-05-07T20:26:11.4455815Z cuda-nsight-12.8.55  | 113.2 MB  | ###2       |  32% [A[A[A[A[A[A
2025-05-07T20:26:11.4456116Z 
2025-05-07T20:26:11.4456120Z 
2025-05-07T20:26:11.4456123Z 
2025-05-07T20:26:11.4456127Z 
2025-05-07T20:26:11.4456131Z 
2025-05-07T20:26:11.4456135Z 
2025-05-07T20:26:11.4460027Z 
2025-05-07T20:26:11.5225797Z cuda-nvvp-12.8.57    | 112.4 MB  | ###3       |  33% [A[A[A[A[A[A[A
2025-05-07T20:26:11.5226107Z 
2025-05-07T20:26:11.5226111Z 
2025-05-07T20:26:11.5226114Z 
2025-05-07T20:26:11.5226118Z 
2025-05-07T20:26:11.5226123Z 
2025-05-07T20:26:11.5347827Z libnpp-12.3.3.65     | 130.6 MB  | ###8       |  38% [A[A[A[A[A
2025-05-07T20:26:11.5348102Z 
2025-05-07T20:26:11.5348105Z 
2025-05-07T20:26:11.5348109Z 
2025-05-07T20:26:11.5348132Z 
2025-05-07T20:26:11.5348136Z 
2025-05-07T20:26:11.5348143Z 
2025-05-07T20:26:11.5365823Z cuda-nsight-12.8.55  | 113.2 MB  | ###4       |  35% [A[A[A[A[A[A
2025-05-07T20:26:11.5459481Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:26:11.5459854Z 
2025-05-07T20:26:11.5459860Z 
2025-05-07T20:26:11.5459865Z 
2025-05-07T20:26:11.5459871Z 
2025-05-07T20:26:11.5459874Z 
2025-05-07T20:26:11.5459878Z 
2025-05-07T20:26:11.5462376Z 
2025-05-07T20:26:11.6228087Z cuda-nvvp-12.8.57    | 112.4 MB  | ###5       |  36% [A[A[A[A[A[A[A
2025-05-07T20:26:11.6228382Z 
2025-05-07T20:26:11.6228386Z 
2025-05-07T20:26:11.6228389Z 
2025-05-07T20:26:11.6228393Z 
2025-05-07T20:26:11.6229818Z 
2025-05-07T20:26:11.6369255Z libnpp-12.3.3.65     | 130.6 MB  | ####       |  40% [A[A[A[A[A
2025-05-07T20:26:11.6396990Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  79% 
2025-05-07T20:26:11.6397278Z 
2025-05-07T20:26:11.6397282Z 
2025-05-07T20:26:11.6397286Z 
2025-05-07T20:26:11.6397290Z 
2025-05-07T20:26:11.6397317Z 
2025-05-07T20:26:11.6399498Z 
2025-05-07T20:26:11.7228642Z cuda-nsight-12.8.55  | 113.2 MB  | ###6       |  37% [A[A[A[A[A[A
2025-05-07T20:26:11.7229028Z 
2025-05-07T20:26:11.7229049Z 
2025-05-07T20:26:11.7229053Z 
2025-05-07T20:26:11.7229057Z 
2025-05-07T20:26:11.7229064Z 
2025-05-07T20:26:11.7373763Z libnpp-12.3.3.65     | 130.6 MB  | ####2      |  42% [A[A[A[A[A
2025-05-07T20:26:11.7411739Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  80% 
2025-05-07T20:26:11.7412101Z 
2025-05-07T20:26:11.7412107Z 
2025-05-07T20:26:11.7412111Z 
2025-05-07T20:26:11.7412115Z 
2025-05-07T20:26:11.7412118Z 
2025-05-07T20:26:11.7412687Z 
2025-05-07T20:26:11.7425350Z cuda-nsight-12.8.55  | 113.2 MB  | ###9       |  39% [A[A[A[A[A[A
2025-05-07T20:26:11.7425859Z 
2025-05-07T20:26:11.7425865Z 
2025-05-07T20:26:11.7425871Z 
2025-05-07T20:26:11.7425876Z 
2025-05-07T20:26:11.7425881Z 
2025-05-07T20:26:11.7425886Z 
2025-05-07T20:26:11.7429313Z 
2025-05-07T20:26:11.8265193Z cuda-nvvp-12.8.57    | 112.4 MB  | ###7       |  38% [A[A[A[A[A[A[A
2025-05-07T20:26:11.8265609Z 
2025-05-07T20:26:11.8265614Z 
2025-05-07T20:26:11.8265619Z 
2025-05-07T20:26:11.8265625Z 
2025-05-07T20:26:11.8269247Z 
2025-05-07T20:26:11.8398302Z libnpp-12.3.3.65     | 130.6 MB  | ####4      |  44% [A[A[A[A[A
2025-05-07T20:26:11.8427255Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  81% 
2025-05-07T20:26:11.8427628Z 
2025-05-07T20:26:11.8427791Z 
2025-05-07T20:26:11.8427799Z 
2025-05-07T20:26:11.8427804Z 
2025-05-07T20:26:11.8427809Z 
2025-05-07T20:26:11.8427815Z 
2025-05-07T20:26:11.8428936Z 
2025-05-07T20:26:11.8565067Z cuda-nvvp-12.8.57    | 112.4 MB  | ###9       |  40% [A[A[A[A[A[A[A
2025-05-07T20:26:11.8565475Z 
2025-05-07T20:26:11.8565480Z 
2025-05-07T20:26:11.8565486Z 
2025-05-07T20:26:11.8565502Z 
2025-05-07T20:26:11.8565508Z 
2025-05-07T20:26:11.8565513Z 
2025-05-07T20:26:11.9267025Z cuda-nsight-12.8.55  | 113.2 MB  | ####1      |  41% [A[A[A[A[A[A
2025-05-07T20:26:11.9267446Z 
2025-05-07T20:26:11.9267803Z 
2025-05-07T20:26:11.9267810Z 
2025-05-07T20:26:11.9267814Z 
2025-05-07T20:26:11.9269353Z 
2025-05-07T20:26:11.9428363Z libnpp-12.3.3.65     | 130.6 MB  | ####6      |  46% [A[A[A[A[A
2025-05-07T20:26:11.9428942Z 
2025-05-07T20:26:11.9428946Z 
2025-05-07T20:26:11.9428949Z 
2025-05-07T20:26:11.9428952Z 
2025-05-07T20:26:11.9428956Z 
2025-05-07T20:26:11.9428959Z 
2025-05-07T20:26:11.9432760Z 
2025-05-07T20:26:11.9446323Z cuda-nvvp-12.8.57    | 112.4 MB  | ####2      |  42% [A[A[A[A[A[A[A
2025-05-07T20:26:11.9568245Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:26:11.9568516Z 
2025-05-07T20:26:11.9568521Z 
2025-05-07T20:26:11.9568527Z 
2025-05-07T20:26:11.9568540Z 
2025-05-07T20:26:11.9568546Z 
2025-05-07T20:26:11.9570815Z 
2025-05-07T20:26:12.0271458Z cuda-nsight-12.8.55  | 113.2 MB  | ####3      |  44% [A[A[A[A[A[A
2025-05-07T20:26:12.0271780Z 
2025-05-07T20:26:12.0271784Z 
2025-05-07T20:26:12.0271796Z 
2025-05-07T20:26:12.0271800Z 
2025-05-07T20:26:12.0273558Z 
2025-05-07T20:26:12.0449185Z libnpp-12.3.3.65     | 130.6 MB  | ####8      |  48% [A[A[A[A[A
2025-05-07T20:26:12.0476273Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  82% 
2025-05-07T20:26:12.0476579Z 
2025-05-07T20:26:12.0476584Z 
2025-05-07T20:26:12.0476588Z 
2025-05-07T20:26:12.0476592Z 
2025-05-07T20:26:12.0476596Z 
2025-05-07T20:26:12.0476599Z 
2025-05-07T20:26:12.0478086Z 
2025-05-07T20:26:12.0743016Z cuda-nvvp-12.8.57    | 112.4 MB  | ####4      |  44% [A[A[A[A[A[A[A
2025-05-07T20:26:12.0743391Z 
2025-05-07T20:26:12.0743395Z 
2025-05-07T20:26:12.0743399Z 
2025-05-07T20:26:12.0743403Z 
2025-05-07T20:26:12.0743414Z 
2025-05-07T20:26:12.0747105Z 
2025-05-07T20:26:12.1304660Z cuda-nsight-12.8.55  | 113.2 MB  | ####5      |  46% [A[A[A[A[A[A
2025-05-07T20:26:12.1305015Z 
2025-05-07T20:26:12.1305019Z 
2025-05-07T20:26:12.1305023Z 
2025-05-07T20:26:12.1305027Z 
2025-05-07T20:26:12.1309735Z 
2025-05-07T20:26:12.1450436Z libnpp-12.3.3.65     | 130.6 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:26:12.1478694Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:26:12.1478959Z 
2025-05-07T20:26:12.1478963Z 
2025-05-07T20:26:12.1478967Z 
2025-05-07T20:26:12.1478983Z 
2025-05-07T20:26:12.1478987Z 
2025-05-07T20:26:12.1478990Z 
2025-05-07T20:26:12.1479061Z 
2025-05-07T20:26:12.1786704Z cuda-nvvp-12.8.57    | 112.4 MB  | ####6      |  47% [A[A[A[A[A[A[A
2025-05-07T20:26:12.1787015Z 
2025-05-07T20:26:12.1787019Z 
2025-05-07T20:26:12.1787023Z 
2025-05-07T20:26:12.1787027Z 
2025-05-07T20:26:12.1787031Z 
2025-05-07T20:26:12.1795984Z 
2025-05-07T20:26:12.2305750Z cuda-nsight-12.8.55  | 113.2 MB  | ####7      |  48% [A[A[A[A[A[A
2025-05-07T20:26:12.2306165Z 
2025-05-07T20:26:12.2306169Z 
2025-05-07T20:26:12.2306172Z 
2025-05-07T20:26:12.2306176Z 
2025-05-07T20:26:12.2307884Z 
2025-05-07T20:26:12.2456696Z libnpp-12.3.3.65     | 130.6 MB  | #####2     |  52% [A[A[A[A[A
2025-05-07T20:26:12.2480010Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:26:12.2480490Z 
2025-05-07T20:26:12.2480494Z 
2025-05-07T20:26:12.2480498Z 
2025-05-07T20:26:12.2480502Z 
2025-05-07T20:26:12.2480505Z 
2025-05-07T20:26:12.2480509Z 
2025-05-07T20:26:12.2480850Z 
2025-05-07T20:26:12.2816241Z cuda-nvvp-12.8.57    | 112.4 MB  | ####8      |  49% [A[A[A[A[A[A[A
2025-05-07T20:26:12.2816698Z 
2025-05-07T20:26:12.2816705Z 
2025-05-07T20:26:12.2816711Z 
2025-05-07T20:26:12.2816717Z 
2025-05-07T20:26:12.2816722Z 
2025-05-07T20:26:12.2818674Z 
2025-05-07T20:26:12.3475931Z cuda-nsight-12.8.55  | 113.2 MB  | #####      |  50% [A[A[A[A[A[A
2025-05-07T20:26:12.3476295Z 
2025-05-07T20:26:12.3476299Z 
2025-05-07T20:26:12.3476303Z 
2025-05-07T20:26:12.3476306Z 
2025-05-07T20:26:12.3478551Z 
2025-05-07T20:26:12.3481576Z libnpp-12.3.3.65     | 130.6 MB  | #####4     |  54% [A[A[A[A[A
2025-05-07T20:26:12.3481885Z 
2025-05-07T20:26:12.3481889Z 
2025-05-07T20:26:12.3481893Z 
2025-05-07T20:26:12.3481897Z 
2025-05-07T20:26:12.3481901Z 
2025-05-07T20:26:12.3481905Z 
2025-05-07T20:26:12.3483579Z 
2025-05-07T20:26:12.3488779Z cuda-nvvp-12.8.57    | 112.4 MB  | #####1     |  51% [A[A[A[A[A[A[A
2025-05-07T20:26:12.3832611Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  84% 
2025-05-07T20:26:12.3833308Z 
2025-05-07T20:26:12.3833313Z 
2025-05-07T20:26:12.3833316Z 
2025-05-07T20:26:12.3833320Z 
2025-05-07T20:26:12.3833324Z 
2025-05-07T20:26:12.3836138Z 
2025-05-07T20:26:12.4477578Z cuda-nsight-12.8.55  | 113.2 MB  | #####2     |  52% [A[A[A[A[A[A
2025-05-07T20:26:12.4477965Z 
2025-05-07T20:26:12.4477969Z 
2025-05-07T20:26:12.4477973Z 
2025-05-07T20:26:12.4477976Z 
2025-05-07T20:26:12.4477980Z 
2025-05-07T20:26:12.4483749Z libnpp-12.3.3.65     | 130.6 MB  | #####6     |  56% [A[A[A[A[A
2025-05-07T20:26:12.4484070Z 
2025-05-07T20:26:12.4484074Z 
2025-05-07T20:26:12.4484078Z 
2025-05-07T20:26:12.4484081Z 
2025-05-07T20:26:12.4484085Z 
2025-05-07T20:26:12.4484089Z 
2025-05-07T20:26:12.4486058Z 
2025-05-07T20:26:12.4553168Z cuda-nvvp-12.8.57    | 112.4 MB  | #####3     |  53% [A[A[A[A[A[A[A
2025-05-07T20:26:12.4836190Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  84% 
2025-05-07T20:26:12.4836574Z 
2025-05-07T20:26:12.4836679Z 
2025-05-07T20:26:12.4836730Z 
2025-05-07T20:26:12.4836735Z 
2025-05-07T20:26:12.4836741Z 
2025-05-07T20:26:12.4836750Z 
2025-05-07T20:26:12.5502753Z cuda-nsight-12.8.55  | 113.2 MB  | #####4     |  54% [A[A[A[A[A[A
2025-05-07T20:26:12.5503085Z 
2025-05-07T20:26:12.5503090Z 
2025-05-07T20:26:12.5503093Z 
2025-05-07T20:26:12.5503097Z 
2025-05-07T20:26:12.5510005Z 
2025-05-07T20:26:12.5571818Z libnpp-12.3.3.65     | 130.6 MB  | #####8     |  58% [A[A[A[A[A
2025-05-07T20:26:12.5575947Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  85% 
2025-05-07T20:26:12.5576205Z 
2025-05-07T20:26:12.5576209Z 
2025-05-07T20:26:12.5576213Z 
2025-05-07T20:26:12.5576216Z 
2025-05-07T20:26:12.5576220Z 
2025-05-07T20:26:12.5576232Z 
2025-05-07T20:26:12.5577909Z 
2025-05-07T20:26:12.5837520Z cuda-nvvp-12.8.57    | 112.4 MB  | #####5     |  56% [A[A[A[A[A[A[A
2025-05-07T20:26:12.5837953Z 
2025-05-07T20:26:12.5837959Z 
2025-05-07T20:26:12.5837965Z 
2025-05-07T20:26:12.5837970Z 
2025-05-07T20:26:12.5837974Z 
2025-05-07T20:26:12.5839114Z 
2025-05-07T20:26:12.6510022Z cuda-nsight-12.8.55  | 113.2 MB  | #####6     |  57% [A[A[A[A[A[A
2025-05-07T20:26:12.6510433Z 
2025-05-07T20:26:12.6510438Z 
2025-05-07T20:26:12.6510441Z 
2025-05-07T20:26:12.6510445Z 
2025-05-07T20:26:12.6512349Z 
2025-05-07T20:26:12.6573051Z libnpp-12.3.3.65     | 130.6 MB  | ######     |  60% [A[A[A[A[A
2025-05-07T20:26:12.6576107Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  85% 
2025-05-07T20:26:12.6576441Z 
2025-05-07T20:26:12.6576446Z 
2025-05-07T20:26:12.6576449Z 
2025-05-07T20:26:12.6576453Z 
2025-05-07T20:26:12.6576457Z 
2025-05-07T20:26:12.6576460Z 
2025-05-07T20:26:12.6576464Z 
2025-05-07T20:26:12.6919154Z cuda-nvvp-12.8.57    | 112.4 MB  | #####7     |  58% [A[A[A[A[A[A[A
2025-05-07T20:26:12.6919576Z 
2025-05-07T20:26:12.6919754Z 
2025-05-07T20:26:12.6919786Z 
2025-05-07T20:26:12.6919792Z 
2025-05-07T20:26:12.6919797Z 
2025-05-07T20:26:12.6924381Z 
2025-05-07T20:26:12.7514581Z cuda-nsight-12.8.55  | 113.2 MB  | #####8     |  59% [A[A[A[A[A[A
2025-05-07T20:26:12.7514957Z 
2025-05-07T20:26:12.7514961Z 
2025-05-07T20:26:12.7514965Z 
2025-05-07T20:26:12.7514969Z 
2025-05-07T20:26:12.7516434Z 
2025-05-07T20:26:12.7582123Z libnpp-12.3.3.65     | 130.6 MB  | ######2    |  62% [A[A[A[A[A
2025-05-07T20:26:12.7582531Z 
2025-05-07T20:26:12.7582535Z 
2025-05-07T20:26:12.7582539Z 
2025-05-07T20:26:12.7582542Z 
2025-05-07T20:26:12.7582546Z 
2025-05-07T20:26:12.7582549Z 
2025-05-07T20:26:12.7582553Z 
2025-05-07T20:26:12.7620857Z cuda-nvvp-12.8.57    | 112.4 MB  | ######     |  60% [A[A[A[A[A[A[A
2025-05-07T20:26:12.7947852Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  86% 
2025-05-07T20:26:12.7948222Z 
2025-05-07T20:26:12.7948228Z 
2025-05-07T20:26:12.7948244Z 
2025-05-07T20:26:12.7948249Z 
2025-05-07T20:26:12.7948255Z 
2025-05-07T20:26:12.7949767Z 
2025-05-07T20:26:12.8516184Z cuda-nsight-12.8.55  | 113.2 MB  | ######     |  61% [A[A[A[A[A[A
2025-05-07T20:26:12.8516524Z 
2025-05-07T20:26:12.8516528Z 
2025-05-07T20:26:12.8516531Z 
2025-05-07T20:26:12.8516717Z 
2025-05-07T20:26:12.8523373Z 
2025-05-07T20:26:12.8593167Z libnpp-12.3.3.65     | 130.6 MB  | ######4    |  64% [A[A[A[A[A
2025-05-07T20:26:12.8593557Z 
2025-05-07T20:26:12.8593561Z 
2025-05-07T20:26:12.8593565Z 
2025-05-07T20:26:12.8593568Z 
2025-05-07T20:26:12.8593572Z 
2025-05-07T20:26:12.8593576Z 
2025-05-07T20:26:12.8595700Z 
2025-05-07T20:26:12.8629141Z cuda-nvvp-12.8.57    | 112.4 MB  | ######2    |  63% [A[A[A[A[A[A[A
2025-05-07T20:26:12.8948196Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:26:12.8948588Z 
2025-05-07T20:26:12.8948788Z 
2025-05-07T20:26:12.8948795Z 
2025-05-07T20:26:12.8948801Z 
2025-05-07T20:26:12.8948806Z 
2025-05-07T20:26:12.8950404Z 
2025-05-07T20:26:12.9521120Z cuda-nsight-12.8.55  | 113.2 MB  | ######3    |  63% [A[A[A[A[A[A
2025-05-07T20:26:12.9521460Z 
2025-05-07T20:26:12.9521465Z 
2025-05-07T20:26:12.9521469Z 
2025-05-07T20:26:12.9521474Z 
2025-05-07T20:26:12.9521486Z 
2025-05-07T20:26:12.9599296Z libnpp-12.3.3.65     | 130.6 MB  | ######6    |  66% [A[A[A[A[A
2025-05-07T20:26:12.9599599Z 
2025-05-07T20:26:12.9599603Z 
2025-05-07T20:26:12.9599607Z 
2025-05-07T20:26:12.9599610Z 
2025-05-07T20:26:12.9599621Z 
2025-05-07T20:26:12.9599624Z 
2025-05-07T20:26:12.9603369Z 
2025-05-07T20:26:12.9632004Z cuda-nvvp-12.8.57    | 112.4 MB  | ######4    |  65% [A[A[A[A[A[A[A
2025-05-07T20:26:12.9953643Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  87% 
2025-05-07T20:26:12.9953909Z 
2025-05-07T20:26:12.9953913Z 
2025-05-07T20:26:12.9953916Z 
2025-05-07T20:26:12.9953920Z 
2025-05-07T20:26:12.9953924Z 
2025-05-07T20:26:12.9956042Z 
2025-05-07T20:26:13.0602377Z cuda-nsight-12.8.55  | 113.2 MB  | ######5    |  66% [A[A[A[A[A[A
2025-05-07T20:26:13.0602695Z 
2025-05-07T20:26:13.0602700Z 
2025-05-07T20:26:13.0602703Z 
2025-05-07T20:26:13.0602736Z 
2025-05-07T20:26:13.0602740Z 
2025-05-07T20:26:13.0602743Z 
2025-05-07T20:26:13.0603445Z 
2025-05-07T20:26:13.0627313Z cuda-nvvp-12.8.57    | 112.4 MB  | ######7    |  67% [A[A[A[A[A[A[A
2025-05-07T20:26:13.0627625Z 
2025-05-07T20:26:13.0627629Z 
2025-05-07T20:26:13.0627633Z 
2025-05-07T20:26:13.0627644Z 
2025-05-07T20:26:13.0627648Z 
2025-05-07T20:26:13.0650542Z libnpp-12.3.3.65     | 130.6 MB  | ######8    |  68% [A[A[A[A[A
2025-05-07T20:26:13.1049241Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  88% 
2025-05-07T20:26:13.1049574Z 
2025-05-07T20:26:13.1049637Z 
2025-05-07T20:26:13.1049643Z 
2025-05-07T20:26:13.1049648Z 
2025-05-07T20:26:13.1049654Z 
2025-05-07T20:26:13.1056919Z 
2025-05-07T20:26:13.1603345Z cuda-nsight-12.8.55  | 113.2 MB  | ######8    |  68% [A[A[A[A[A[A
2025-05-07T20:26:13.1603862Z 
2025-05-07T20:26:13.1603871Z 
2025-05-07T20:26:13.1603877Z 
2025-05-07T20:26:13.1603884Z 
2025-05-07T20:26:13.1603894Z 
2025-05-07T20:26:13.1603902Z 
2025-05-07T20:26:13.1604522Z 
2025-05-07T20:26:13.1630115Z cuda-nvvp-12.8.57    | 112.4 MB  | ######9    |  70% [A[A[A[A[A[A[A
2025-05-07T20:26:13.1630423Z 
2025-05-07T20:26:13.1630427Z 
2025-05-07T20:26:13.1630446Z 
2025-05-07T20:26:13.1630449Z 
2025-05-07T20:26:13.1632597Z 
2025-05-07T20:26:13.1652138Z libnpp-12.3.3.65     | 130.6 MB  | #######    |  70% [A[A[A[A[A
2025-05-07T20:26:13.2057034Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  89% 
2025-05-07T20:26:13.2057419Z 
2025-05-07T20:26:13.2057425Z 
2025-05-07T20:26:13.2057430Z 
2025-05-07T20:26:13.2057435Z 
2025-05-07T20:26:13.2057440Z 
2025-05-07T20:26:13.2059982Z 
2025-05-07T20:26:13.2656003Z cuda-nsight-12.8.55  | 113.2 MB  | #######    |  70% [A[A[A[A[A[A
2025-05-07T20:26:13.2656332Z 
2025-05-07T20:26:13.2656336Z 
2025-05-07T20:26:13.2656340Z 
2025-05-07T20:26:13.2656343Z 
2025-05-07T20:26:13.2656348Z 
2025-05-07T20:26:13.2656352Z 
2025-05-07T20:26:13.2656359Z 
2025-05-07T20:26:13.2686901Z cuda-nvvp-12.8.57    | 112.4 MB  | #######1   |  72% [A[A[A[A[A[A[A
2025-05-07T20:26:13.2687213Z 
2025-05-07T20:26:13.2687217Z 
2025-05-07T20:26:13.2687221Z 
2025-05-07T20:26:13.2687225Z 
2025-05-07T20:26:13.2688515Z 
2025-05-07T20:26:13.2694772Z libnpp-12.3.3.65     | 130.6 MB  | #######2   |  72% [A[A[A[A[A
2025-05-07T20:26:13.3067432Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:26:13.3067763Z 
2025-05-07T20:26:13.3067778Z 
2025-05-07T20:26:13.3067784Z 
2025-05-07T20:26:13.3067789Z 
2025-05-07T20:26:13.3067794Z 
2025-05-07T20:26:13.3069345Z 
2025-05-07T20:26:13.3659030Z cuda-nsight-12.8.55  | 113.2 MB  | #######2   |  73% [A[A[A[A[A[A
2025-05-07T20:26:13.3659368Z 
2025-05-07T20:26:13.3659372Z 
2025-05-07T20:26:13.3659376Z 
2025-05-07T20:26:13.3659380Z 
2025-05-07T20:26:13.3659384Z 
2025-05-07T20:26:13.3659388Z 
2025-05-07T20:26:13.3659392Z 
2025-05-07T20:26:13.3686814Z cuda-nvvp-12.8.57    | 112.4 MB  | #######4   |  74% [A[A[A[A[A[A[A
2025-05-07T20:26:13.3687133Z 
2025-05-07T20:26:13.3687139Z 
2025-05-07T20:26:13.3687172Z 
2025-05-07T20:26:13.3687178Z 
2025-05-07T20:26:13.3689302Z 
2025-05-07T20:26:13.3697817Z libnpp-12.3.3.65     | 130.6 MB  | #######4   |  74% [A[A[A[A[A
2025-05-07T20:26:13.4070921Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  90% 
2025-05-07T20:26:13.4071195Z 
2025-05-07T20:26:13.4071199Z 
2025-05-07T20:26:13.4071212Z 
2025-05-07T20:26:13.4071216Z 
2025-05-07T20:26:13.4071219Z 
2025-05-07T20:26:13.4074384Z 
2025-05-07T20:26:13.4665538Z cuda-nsight-12.8.55  | 113.2 MB  | #######4   |  75% [A[A[A[A[A[A
2025-05-07T20:26:13.4665874Z 
2025-05-07T20:26:13.4665878Z 
2025-05-07T20:26:13.4665889Z 
2025-05-07T20:26:13.4665893Z 
2025-05-07T20:26:13.4665897Z 
2025-05-07T20:26:13.4665901Z 
2025-05-07T20:26:13.4665905Z 
2025-05-07T20:26:13.4731524Z cuda-nvvp-12.8.57    | 112.4 MB  | #######6   |  77% [A[A[A[A[A[A[A
2025-05-07T20:26:13.4801142Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  90% 
2025-05-07T20:26:13.4801416Z 
2025-05-07T20:26:13.4801420Z 
2025-05-07T20:26:13.4801425Z 
2025-05-07T20:26:13.4801454Z 
2025-05-07T20:26:13.4806375Z 
2025-05-07T20:26:13.5106447Z libnpp-12.3.3.65     | 130.6 MB  | #######6   |  76% [A[A[A[A[A
2025-05-07T20:26:13.5106756Z 
2025-05-07T20:26:13.5106784Z 
2025-05-07T20:26:13.5106788Z 
2025-05-07T20:26:13.5106791Z 
2025-05-07T20:26:13.5106795Z 
2025-05-07T20:26:13.5107560Z 
2025-05-07T20:26:13.5670019Z cuda-nsight-12.8.55  | 113.2 MB  | #######7   |  77% [A[A[A[A[A[A
2025-05-07T20:26:13.5670616Z 
2025-05-07T20:26:13.5670620Z 
2025-05-07T20:26:13.5670625Z 
2025-05-07T20:26:13.5670629Z 
2025-05-07T20:26:13.5670634Z 
2025-05-07T20:26:13.5670638Z 
2025-05-07T20:26:13.5670643Z 
2025-05-07T20:26:13.5737940Z cuda-nvvp-12.8.57    | 112.4 MB  | #######9   |  79% [A[A[A[A[A[A[A
2025-05-07T20:26:13.5898301Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  91% 
2025-05-07T20:26:13.5898667Z 
2025-05-07T20:26:13.5898680Z 
2025-05-07T20:26:13.5898684Z 
2025-05-07T20:26:13.5898688Z 
2025-05-07T20:26:13.5910216Z 
2025-05-07T20:26:13.6115875Z libnpp-12.3.3.65     | 130.6 MB  | #######8   |  78% [A[A[A[A[A
2025-05-07T20:26:13.6116257Z 
2025-05-07T20:26:13.6116263Z 
2025-05-07T20:26:13.6116269Z 
2025-05-07T20:26:13.6116274Z 
2025-05-07T20:26:13.6116292Z 
2025-05-07T20:26:13.6116298Z 
2025-05-07T20:26:13.6693331Z cuda-nsight-12.8.55  | 113.2 MB  | #######9   |  79% [A[A[A[A[A[A
2025-05-07T20:26:13.6693666Z 
2025-05-07T20:26:13.6693670Z 
2025-05-07T20:26:13.6693673Z 
2025-05-07T20:26:13.6693677Z 
2025-05-07T20:26:13.6693681Z 
2025-05-07T20:26:13.6693685Z 
2025-05-07T20:26:13.6698071Z 
2025-05-07T20:26:13.6911922Z cuda-nvvp-12.8.57    | 112.4 MB  | ########1  |  82% [A[A[A[A[A[A[A
2025-05-07T20:26:13.6953949Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  92% 
2025-05-07T20:26:13.6954217Z 
2025-05-07T20:26:13.6954221Z 
2025-05-07T20:26:13.6954225Z 
2025-05-07T20:26:13.6954228Z 
2025-05-07T20:26:13.6954260Z 
2025-05-07T20:26:13.7120308Z libnpp-12.3.3.65     | 130.6 MB  | ########   |  80% [A[A[A[A[A
2025-05-07T20:26:13.7120614Z 
2025-05-07T20:26:13.7120864Z 
2025-05-07T20:26:13.7120869Z 
2025-05-07T20:26:13.7120873Z 
2025-05-07T20:26:13.7120885Z 
2025-05-07T20:26:13.7129221Z 
2025-05-07T20:26:13.7719158Z cuda-nsight-12.8.55  | 113.2 MB  | ########1  |  82% [A[A[A[A[A[A
2025-05-07T20:26:13.7719750Z 
2025-05-07T20:26:13.7719756Z 
2025-05-07T20:26:13.7719771Z 
2025-05-07T20:26:13.7719775Z 
2025-05-07T20:26:13.7719779Z 
2025-05-07T20:26:13.7719782Z 
2025-05-07T20:26:13.7719786Z 
2025-05-07T20:26:13.7970530Z cuda-nvvp-12.8.57    | 112.4 MB  | ########4  |  84% [A[A[A[A[A[A[A
2025-05-07T20:26:13.8043865Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  92% 
2025-05-07T20:26:13.8044239Z 
2025-05-07T20:26:13.8044245Z 
2025-05-07T20:26:13.8044251Z 
2025-05-07T20:26:13.8044257Z 
2025-05-07T20:26:13.8044261Z 
2025-05-07T20:26:13.8142386Z libnpp-12.3.3.65     | 130.6 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:26:13.8142725Z 
2025-05-07T20:26:13.8142729Z 
2025-05-07T20:26:13.8142734Z 
2025-05-07T20:26:13.8142737Z 
2025-05-07T20:26:13.8142762Z 
2025-05-07T20:26:13.8142766Z 
2025-05-07T20:26:13.8736644Z cuda-nsight-12.8.55  | 113.2 MB  | ########4  |  84% [A[A[A[A[A[A
2025-05-07T20:26:13.8737047Z 
2025-05-07T20:26:13.8737073Z 
2025-05-07T20:26:13.8737076Z 
2025-05-07T20:26:13.8737080Z 
2025-05-07T20:26:13.8737084Z 
2025-05-07T20:26:13.8737097Z 
2025-05-07T20:26:13.8743458Z 
2025-05-07T20:26:13.8972005Z cuda-nvvp-12.8.57    | 112.4 MB  | ########6  |  87% [A[A[A[A[A[A[A
2025-05-07T20:26:13.9045578Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  93% 
2025-05-07T20:26:13.9045940Z 
2025-05-07T20:26:13.9045945Z 
2025-05-07T20:26:13.9045948Z 
2025-05-07T20:26:13.9045952Z 
2025-05-07T20:26:13.9045956Z 
2025-05-07T20:26:13.9144451Z libnpp-12.3.3.65     | 130.6 MB  | ########4  |  84% [A[A[A[A[A
2025-05-07T20:26:13.9144734Z 
2025-05-07T20:26:13.9144738Z 
2025-05-07T20:26:13.9144742Z 
2025-05-07T20:26:13.9144745Z 
2025-05-07T20:26:13.9144749Z 
2025-05-07T20:26:13.9147652Z 
2025-05-07T20:26:13.9819189Z cuda-nsight-12.8.55  | 113.2 MB  | ########6  |  87% [A[A[A[A[A[A
2025-05-07T20:26:13.9819522Z 
2025-05-07T20:26:13.9819526Z 
2025-05-07T20:26:13.9819530Z 
2025-05-07T20:26:13.9819534Z 
2025-05-07T20:26:13.9819552Z 
2025-05-07T20:26:13.9819565Z 
2025-05-07T20:26:13.9819569Z 
2025-05-07T20:26:14.0057007Z cuda-nvvp-12.8.57    | 112.4 MB  | ########8  |  89% [A[A[A[A[A[A[A
2025-05-07T20:26:14.0057391Z 
2025-05-07T20:26:14.0057396Z 
2025-05-07T20:26:14.0057399Z 
2025-05-07T20:26:14.0057412Z 
2025-05-07T20:26:14.0060134Z 
2025-05-07T20:26:14.0066452Z libnpp-12.3.3.65     | 130.6 MB  | ########6  |  86% [A[A[A[A[A
2025-05-07T20:26:14.0148942Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  94% 
2025-05-07T20:26:14.0149329Z 
2025-05-07T20:26:14.0149335Z 
2025-05-07T20:26:14.0149340Z 
2025-05-07T20:26:14.0149346Z 
2025-05-07T20:26:14.0149351Z 
2025-05-07T20:26:14.0152320Z 
2025-05-07T20:26:14.0854102Z cuda-nsight-12.8.55  | 113.2 MB  | ########9  |  89% [A[A[A[A[A[A
2025-05-07T20:26:14.0854531Z 
2025-05-07T20:26:14.0854568Z 
2025-05-07T20:26:14.0854574Z 
2025-05-07T20:26:14.0854579Z 
2025-05-07T20:26:14.0854584Z 
2025-05-07T20:26:14.0854589Z 
2025-05-07T20:26:14.0854595Z 
2025-05-07T20:26:14.1068626Z cuda-nvvp-12.8.57    | 112.4 MB  | #########1 |  91% [A[A[A[A[A[A[A
2025-05-07T20:26:14.1110850Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  94% 
2025-05-07T20:26:14.1111115Z 
2025-05-07T20:26:14.1111421Z 
2025-05-07T20:26:14.1111425Z 
2025-05-07T20:26:14.1111429Z 
2025-05-07T20:26:14.1112969Z 
2025-05-07T20:26:14.1151008Z libnpp-12.3.3.65     | 130.6 MB  | ########8  |  88% [A[A[A[A[A
2025-05-07T20:26:14.1151430Z 
2025-05-07T20:26:14.1151436Z 
2025-05-07T20:26:14.1151441Z 
2025-05-07T20:26:14.1151447Z 
2025-05-07T20:26:14.1151452Z 
2025-05-07T20:26:14.1151457Z 
2025-05-07T20:26:14.1854544Z cuda-nsight-12.8.55  | 113.2 MB  | #########1 |  92% [A[A[A[A[A[A
2025-05-07T20:26:14.1855010Z 
2025-05-07T20:26:14.1855026Z 
2025-05-07T20:26:14.1855032Z 
2025-05-07T20:26:14.1855037Z 
2025-05-07T20:26:14.1855042Z 
2025-05-07T20:26:14.1855302Z 
2025-05-07T20:26:14.1855308Z 
2025-05-07T20:26:14.2071221Z cuda-nvvp-12.8.57    | 112.4 MB  | #########3 |  94% [A[A[A[A[A[A[A
2025-05-07T20:26:14.2143416Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  95% 
2025-05-07T20:26:14.2143808Z 
2025-05-07T20:26:14.2143814Z 
2025-05-07T20:26:14.2143820Z 
2025-05-07T20:26:14.2143825Z 
2025-05-07T20:26:14.2152605Z 
2025-05-07T20:26:14.2243308Z libnpp-12.3.3.65     | 130.6 MB  | #########  |  90% [A[A[A[A[A
2025-05-07T20:26:14.2243739Z 
2025-05-07T20:26:14.2243745Z 
2025-05-07T20:26:14.2243750Z 
2025-05-07T20:26:14.2243756Z 
2025-05-07T20:26:14.2243770Z 
2025-05-07T20:26:14.2243775Z 
2025-05-07T20:26:14.2857548Z cuda-nsight-12.8.55  | 113.2 MB  | #########4 |  94% [A[A[A[A[A[A
2025-05-07T20:26:14.2857972Z 
2025-05-07T20:26:14.2857976Z 
2025-05-07T20:26:14.2857980Z 
2025-05-07T20:26:14.2857992Z 
2025-05-07T20:26:14.2857996Z 
2025-05-07T20:26:14.2858000Z 
2025-05-07T20:26:14.2863831Z 
2025-05-07T20:26:14.3153751Z cuda-nvvp-12.8.57    | 112.4 MB  | #########6 |  96% [A[A[A[A[A[A[A
2025-05-07T20:26:14.3201613Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  95% 
2025-05-07T20:26:14.3201900Z 
2025-05-07T20:26:14.3201958Z 
2025-05-07T20:26:14.3201964Z 
2025-05-07T20:26:14.3201968Z 
2025-05-07T20:26:14.3205717Z 
2025-05-07T20:26:14.3282328Z libnpp-12.3.3.65     | 130.6 MB  | #########2 |  92% [A[A[A[A[A
2025-05-07T20:26:14.3282626Z 
2025-05-07T20:26:14.3282630Z 
2025-05-07T20:26:14.3282634Z 
2025-05-07T20:26:14.3282637Z 
2025-05-07T20:26:14.3282641Z 
2025-05-07T20:26:14.3285062Z 
2025-05-07T20:26:14.3858015Z cuda-nsight-12.8.55  | 113.2 MB  | #########6 |  96% [A[A[A[A[A[A
2025-05-07T20:26:14.3858494Z 
2025-05-07T20:26:14.3858501Z 
2025-05-07T20:26:14.3858506Z 
2025-05-07T20:26:14.3858525Z 
2025-05-07T20:26:14.3858530Z 
2025-05-07T20:26:14.3858535Z 
2025-05-07T20:26:14.3858540Z 
2025-05-07T20:26:14.4156407Z cuda-nvvp-12.8.57    | 112.4 MB  | #########8 |  99% [A[A[A[A[A[A[A
2025-05-07T20:26:14.4283587Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  96% 
2025-05-07T20:26:14.4283950Z 
2025-05-07T20:26:14.4283954Z 
2025-05-07T20:26:14.4283958Z 
2025-05-07T20:26:14.4283962Z 
2025-05-07T20:26:14.4283976Z 
2025-05-07T20:26:14.4286123Z 
2025-05-07T20:26:14.4288896Z cuda-nsight-12.8.55  | 113.2 MB  | #########8 |  99% [A[A[A[A[A[A
2025-05-07T20:26:14.4289250Z 
2025-05-07T20:26:14.4289254Z 
2025-05-07T20:26:14.4289258Z 
2025-05-07T20:26:14.4289261Z 
2025-05-07T20:26:14.4289273Z 
2025-05-07T20:26:14.5157917Z libnpp-12.3.3.65     | 130.6 MB  | #########3 |  94% [A[A[A[A[A
2025-05-07T20:26:14.5285300Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  97% 
2025-05-07T20:26:14.5285667Z 
2025-05-07T20:26:14.5285672Z 
2025-05-07T20:26:14.5285675Z 
2025-05-07T20:26:14.5285679Z 
2025-05-07T20:26:14.5289587Z 
2025-05-07T20:26:14.6160467Z libnpp-12.3.3.65     | 130.6 MB  | #########6 |  96% [A[A[A[A[A
2025-05-07T20:26:14.6290349Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  97% 
2025-05-07T20:26:14.6290737Z 
2025-05-07T20:26:14.6290774Z 
2025-05-07T20:26:14.6290778Z 
2025-05-07T20:26:14.6290782Z 
2025-05-07T20:26:14.6290786Z 
2025-05-07T20:26:14.7161716Z libnpp-12.3.3.65     | 130.6 MB  | #########8 |  98% [A[A[A[A[A
2025-05-07T20:26:14.8162036Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  98% 
2025-05-07T20:26:14.9164942Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  99% 
2025-05-07T20:26:17.8908872Z libcublas-12.8.3.14  | 460.2 MB  | #########9 | 100% 
2025-05-07T20:26:17.8909156Z 
2025-05-07T20:26:17.8909159Z 
2025-05-07T20:26:17.8909164Z 
2025-05-07T20:26:17.8909167Z 
2025-05-07T20:26:17.9631202Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:17.9631620Z 
2025-05-07T20:26:17.9631626Z 
2025-05-07T20:26:17.9631632Z 
2025-05-07T20:26:17.9631637Z 
2025-05-07T20:26:17.9631642Z 
2025-05-07T20:26:17.9631648Z 
2025-05-07T20:26:17.9633955Z 
2025-05-07T20:26:18.0112938Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:18.0113239Z 
2025-05-07T20:26:18.0113738Z 
2025-05-07T20:26:18.0113748Z 
2025-05-07T20:26:18.0113753Z 
2025-05-07T20:26:18.0113758Z 
2025-05-07T20:26:18.0113764Z 
2025-05-07T20:26:18.0113769Z 
2025-05-07T20:26:18.0115118Z 
2025-05-07T20:26:18.1116135Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.1116481Z 
2025-05-07T20:26:18.1116485Z 
2025-05-07T20:26:18.1116489Z 
2025-05-07T20:26:18.1116492Z 
2025-05-07T20:26:18.1116496Z 
2025-05-07T20:26:18.1116499Z 
2025-05-07T20:26:18.1116503Z 
2025-05-07T20:26:18.1116506Z 
2025-05-07T20:26:18.1239654Z cuda-nvrtc-12.8.61   | 63.1 MB   | 4          |   5% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.1239950Z 
2025-05-07T20:26:18.1239954Z 
2025-05-07T20:26:18.1239958Z 
2025-05-07T20:26:18.1239962Z 
2025-05-07T20:26:18.1239965Z 
2025-05-07T20:26:18.1239969Z 
2025-05-07T20:26:18.1598253Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:18.1598560Z 
2025-05-07T20:26:18.1598563Z 
2025-05-07T20:26:18.1598583Z 
2025-05-07T20:26:18.1598587Z 
2025-05-07T20:26:18.1598590Z 
2025-05-07T20:26:18.1598594Z 
2025-05-07T20:26:18.1598598Z 
2025-05-07T20:26:18.1598601Z 
2025-05-07T20:26:18.1601822Z 
2025-05-07T20:26:18.2175493Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.2175803Z 
2025-05-07T20:26:18.2175807Z 
2025-05-07T20:26:18.2175810Z 
2025-05-07T20:26:18.2175814Z 
2025-05-07T20:26:18.2175818Z 
2025-05-07T20:26:18.2175821Z 
2025-05-07T20:26:18.2175825Z 
2025-05-07T20:26:18.2177294Z 
2025-05-07T20:26:18.2603342Z cuda-nvrtc-12.8.61   | 63.1 MB   | 9          |  10% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.2603720Z 
2025-05-07T20:26:18.2603724Z 
2025-05-07T20:26:18.2603728Z 
2025-05-07T20:26:18.2603732Z 
2025-05-07T20:26:18.2603735Z 
2025-05-07T20:26:18.2603739Z 
2025-05-07T20:26:18.2603743Z 
2025-05-07T20:26:18.2603746Z 
2025-05-07T20:26:18.2605180Z 
2025-05-07T20:26:18.3178558Z libcurand-10.3.9.55  | 43.6 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.3178905Z 
2025-05-07T20:26:18.3178909Z 
2025-05-07T20:26:18.3178913Z 
2025-05-07T20:26:18.3178916Z 
2025-05-07T20:26:18.3178920Z 
2025-05-07T20:26:18.3178924Z 
2025-05-07T20:26:18.3178935Z 
2025-05-07T20:26:18.3180070Z 
2025-05-07T20:26:18.3708320Z cuda-nvrtc-12.8.61   | 63.1 MB   | #4         |  15% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.3708658Z 
2025-05-07T20:26:18.3708661Z 
2025-05-07T20:26:18.3708665Z 
2025-05-07T20:26:18.3708668Z 
2025-05-07T20:26:18.3708672Z 
2025-05-07T20:26:18.3708675Z 
2025-05-07T20:26:18.3708679Z 
2025-05-07T20:26:18.3708682Z 
2025-05-07T20:26:18.3709929Z 
2025-05-07T20:26:18.4411137Z libcurand-10.3.9.55  | 43.6 MB   | #4         |  14% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.4411564Z 
2025-05-07T20:26:18.4411568Z 
2025-05-07T20:26:18.4411572Z 
2025-05-07T20:26:18.4411585Z 
2025-05-07T20:26:18.4411589Z 
2025-05-07T20:26:18.4411592Z 
2025-05-07T20:26:18.4411596Z 
2025-05-07T20:26:18.4414449Z 
2025-05-07T20:26:18.4827134Z cuda-nvrtc-12.8.61   | 63.1 MB   | #9         |  19% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.4827442Z 
2025-05-07T20:26:18.4827446Z 
2025-05-07T20:26:18.4827449Z 
2025-05-07T20:26:18.4827453Z 
2025-05-07T20:26:18.4827462Z 
2025-05-07T20:26:18.4827465Z 
2025-05-07T20:26:18.4827469Z 
2025-05-07T20:26:18.4827473Z 
2025-05-07T20:26:18.4827476Z 
2025-05-07T20:26:18.5450606Z libcurand-10.3.9.55  | 43.6 MB   | ##         |  21% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.5450912Z 
2025-05-07T20:26:18.5450916Z 
2025-05-07T20:26:18.5450920Z 
2025-05-07T20:26:18.5450923Z 
2025-05-07T20:26:18.5450927Z 
2025-05-07T20:26:18.5450930Z 
2025-05-07T20:26:18.5450934Z 
2025-05-07T20:26:18.5452810Z 
2025-05-07T20:26:18.5828768Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##3        |  24% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.5829146Z 
2025-05-07T20:26:18.5829150Z 
2025-05-07T20:26:18.5829154Z 
2025-05-07T20:26:18.5829157Z 
2025-05-07T20:26:18.5829161Z 
2025-05-07T20:26:18.5829164Z 
2025-05-07T20:26:18.5829168Z 
2025-05-07T20:26:18.5829171Z 
2025-05-07T20:26:18.5829737Z 
2025-05-07T20:26:18.6477703Z libcurand-10.3.9.55  | 43.6 MB   | ##7        |  27% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.6478016Z 
2025-05-07T20:26:18.6478222Z 
2025-05-07T20:26:18.6478228Z 
2025-05-07T20:26:18.6478233Z 
2025-05-07T20:26:18.6478237Z 
2025-05-07T20:26:18.6478240Z 
2025-05-07T20:26:18.6478244Z 
2025-05-07T20:26:18.6486698Z 
2025-05-07T20:26:18.6930538Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##8        |  28% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.6930836Z 
2025-05-07T20:26:18.6930841Z 
2025-05-07T20:26:18.6930845Z 
2025-05-07T20:26:18.6930849Z 
2025-05-07T20:26:18.6930853Z 
2025-05-07T20:26:18.6930857Z 
2025-05-07T20:26:18.6930868Z 
2025-05-07T20:26:18.6930872Z 
2025-05-07T20:26:18.6936938Z 
2025-05-07T20:26:18.7518513Z libcurand-10.3.9.55  | 43.6 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.7518804Z 
2025-05-07T20:26:18.7518808Z 
2025-05-07T20:26:18.7518822Z 
2025-05-07T20:26:18.7518826Z 
2025-05-07T20:26:18.7518831Z 
2025-05-07T20:26:18.7518853Z 
2025-05-07T20:26:18.7518857Z 
2025-05-07T20:26:18.7521518Z 
2025-05-07T20:26:18.7969594Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###2       |  33% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.7969903Z 
2025-05-07T20:26:18.7969907Z 
2025-05-07T20:26:18.7969910Z 
2025-05-07T20:26:18.7969914Z 
2025-05-07T20:26:18.7969917Z 
2025-05-07T20:26:18.7969921Z 
2025-05-07T20:26:18.7969925Z 
2025-05-07T20:26:18.7969928Z 
2025-05-07T20:26:18.7975062Z 
2025-05-07T20:26:18.8519381Z libcurand-10.3.9.55  | 43.6 MB   | ####       |  40% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.8519689Z 
2025-05-07T20:26:18.8519693Z 
2025-05-07T20:26:18.8519697Z 
2025-05-07T20:26:18.8519712Z 
2025-05-07T20:26:18.8519716Z 
2025-05-07T20:26:18.8519720Z 
2025-05-07T20:26:18.8519724Z 
2025-05-07T20:26:18.8523574Z 
2025-05-07T20:26:18.8972550Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###7       |  37% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.8972865Z 
2025-05-07T20:26:18.8972869Z 
2025-05-07T20:26:18.8972873Z 
2025-05-07T20:26:18.8972901Z 
2025-05-07T20:26:18.8972905Z 
2025-05-07T20:26:18.8972909Z 
2025-05-07T20:26:18.8972913Z 
2025-05-07T20:26:18.8972916Z 
2025-05-07T20:26:18.8974341Z 
2025-05-07T20:26:18.9520294Z libcurand-10.3.9.55  | 43.6 MB   | ####6      |  46% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:18.9520627Z 
2025-05-07T20:26:18.9520631Z 
2025-05-07T20:26:18.9520635Z 
2025-05-07T20:26:18.9520638Z 
2025-05-07T20:26:18.9520642Z 
2025-05-07T20:26:18.9520646Z 
2025-05-07T20:26:18.9520649Z 
2025-05-07T20:26:18.9527024Z 
2025-05-07T20:26:18.9973834Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####2      |  42% [A[A[A[A[A[A[A[A
2025-05-07T20:26:18.9974137Z 
2025-05-07T20:26:18.9974146Z 
2025-05-07T20:26:18.9974151Z 
2025-05-07T20:26:18.9974157Z 
2025-05-07T20:26:18.9974162Z 
2025-05-07T20:26:18.9974167Z 
2025-05-07T20:26:18.9974172Z 
2025-05-07T20:26:18.9974177Z 
2025-05-07T20:26:18.9975265Z 
2025-05-07T20:26:19.0624281Z libcurand-10.3.9.55  | 43.6 MB   | #####3     |  54% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.0624643Z 
2025-05-07T20:26:19.0624670Z 
2025-05-07T20:26:19.0624674Z 
2025-05-07T20:26:19.0624677Z 
2025-05-07T20:26:19.0624681Z 
2025-05-07T20:26:19.0624685Z 
2025-05-07T20:26:19.0624689Z 
2025-05-07T20:26:19.0628882Z 
2025-05-07T20:26:19.0979714Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.0980061Z 
2025-05-07T20:26:19.0980065Z 
2025-05-07T20:26:19.0980069Z 
2025-05-07T20:26:19.0980080Z 
2025-05-07T20:26:19.0980084Z 
2025-05-07T20:26:19.0980087Z 
2025-05-07T20:26:19.0980091Z 
2025-05-07T20:26:19.0980094Z 
2025-05-07T20:26:19.0982378Z 
2025-05-07T20:26:19.1480263Z libcurand-10.3.9.55  | 43.6 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.1480623Z 
2025-05-07T20:26:19.1480627Z 
2025-05-07T20:26:19.1480632Z 
2025-05-07T20:26:19.1480635Z 
2025-05-07T20:26:19.1480639Z 
2025-05-07T20:26:19.1630981Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:19.1631344Z 
2025-05-07T20:26:19.1631347Z 
2025-05-07T20:26:19.1631652Z 
2025-05-07T20:26:19.1631657Z 
2025-05-07T20:26:19.1631661Z 
2025-05-07T20:26:19.1631664Z 
2025-05-07T20:26:19.1631668Z 
2025-05-07T20:26:19.1632783Z 
2025-05-07T20:26:19.1981401Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####1     |  52% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.1981748Z 
2025-05-07T20:26:19.1981752Z 
2025-05-07T20:26:19.1981761Z 
2025-05-07T20:26:19.1981765Z 
2025-05-07T20:26:19.1981769Z 
2025-05-07T20:26:19.1981773Z 
2025-05-07T20:26:19.1981776Z 
2025-05-07T20:26:19.1981780Z 
2025-05-07T20:26:19.1982219Z 
2025-05-07T20:26:19.2134133Z libcurand-10.3.9.55  | 43.6 MB   | ######8    |  68% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.2134501Z 
2025-05-07T20:26:19.2134505Z 
2025-05-07T20:26:19.2134508Z 
2025-05-07T20:26:19.2134512Z 
2025-05-07T20:26:19.2134516Z 
2025-05-07T20:26:19.2134520Z 
2025-05-07T20:26:19.2134524Z 
2025-05-07T20:26:19.2134527Z 
2025-05-07T20:26:19.2134531Z 
2025-05-07T20:26:19.2134535Z 
2025-05-07T20:26:19.2687325Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.2687633Z 
2025-05-07T20:26:19.2687637Z 
2025-05-07T20:26:19.2687641Z 
2025-05-07T20:26:19.2687644Z 
2025-05-07T20:26:19.2687648Z 
2025-05-07T20:26:19.2687665Z 
2025-05-07T20:26:19.2687669Z 
2025-05-07T20:26:19.2688714Z 
2025-05-07T20:26:19.3112681Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####6     |  57% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.3113034Z 
2025-05-07T20:26:19.3113038Z 
2025-05-07T20:26:19.3113042Z 
2025-05-07T20:26:19.3113046Z 
2025-05-07T20:26:19.3113049Z 
2025-05-07T20:26:19.3113067Z 
2025-05-07T20:26:19.3113071Z 
2025-05-07T20:26:19.3113075Z 
2025-05-07T20:26:19.3116469Z 
2025-05-07T20:26:19.3135861Z libcurand-10.3.9.55  | 43.6 MB   | #######5   |  75% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.3136192Z 
2025-05-07T20:26:19.3136196Z 
2025-05-07T20:26:19.3136200Z 
2025-05-07T20:26:19.3136203Z 
2025-05-07T20:26:19.3136207Z 
2025-05-07T20:26:19.3136211Z 
2025-05-07T20:26:19.3136214Z 
2025-05-07T20:26:19.3136218Z 
2025-05-07T20:26:19.3136235Z 
2025-05-07T20:26:19.3142024Z 
2025-05-07T20:26:19.3767505Z gds-tools-1.13.0.11  | 37.9 MB   | 6          |   7% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.3767849Z 
2025-05-07T20:26:19.3767875Z 
2025-05-07T20:26:19.3767879Z 
2025-05-07T20:26:19.3767882Z 
2025-05-07T20:26:19.3767886Z 
2025-05-07T20:26:19.3767890Z 
2025-05-07T20:26:19.3767893Z 
2025-05-07T20:26:19.3772699Z 
2025-05-07T20:26:19.4120080Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######1    |  61% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.4120531Z 
2025-05-07T20:26:19.4120535Z 
2025-05-07T20:26:19.4120538Z 
2025-05-07T20:26:19.4120542Z 
2025-05-07T20:26:19.4120545Z 
2025-05-07T20:26:19.4120549Z 
2025-05-07T20:26:19.4120553Z 
2025-05-07T20:26:19.4120556Z 
2025-05-07T20:26:19.4124499Z 
2025-05-07T20:26:19.4170680Z libcurand-10.3.9.55  | 43.6 MB   | ########1  |  82% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.4171010Z 
2025-05-07T20:26:19.4171016Z 
2025-05-07T20:26:19.4171021Z 
2025-05-07T20:26:19.4171026Z 
2025-05-07T20:26:19.4171055Z 
2025-05-07T20:26:19.4171061Z 
2025-05-07T20:26:19.4171077Z 
2025-05-07T20:26:19.4171083Z 
2025-05-07T20:26:19.4171088Z 
2025-05-07T20:26:19.4178257Z 
2025-05-07T20:26:19.4780185Z gds-tools-1.13.0.11  | 37.9 MB   | #3         |  14% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.4780724Z 
2025-05-07T20:26:19.4780732Z 
2025-05-07T20:26:19.4780739Z 
2025-05-07T20:26:19.4780747Z 
2025-05-07T20:26:19.4780753Z 
2025-05-07T20:26:19.4780760Z 
2025-05-07T20:26:19.4780767Z 
2025-05-07T20:26:19.4782023Z 
2025-05-07T20:26:19.5172414Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######5    |  66% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.5172802Z 
2025-05-07T20:26:19.5172807Z 
2025-05-07T20:26:19.5172810Z 
2025-05-07T20:26:19.5172814Z 
2025-05-07T20:26:19.5172818Z 
2025-05-07T20:26:19.5172821Z 
2025-05-07T20:26:19.5172825Z 
2025-05-07T20:26:19.5172829Z 
2025-05-07T20:26:19.5172833Z 
2025-05-07T20:26:19.5176366Z 
2025-05-07T20:26:19.5185138Z gds-tools-1.13.0.11  | 37.9 MB   | ##1        |  21% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.5185498Z 
2025-05-07T20:26:19.5185502Z 
2025-05-07T20:26:19.5185517Z 
2025-05-07T20:26:19.5185521Z 
2025-05-07T20:26:19.5185524Z 
2025-05-07T20:26:19.5185528Z 
2025-05-07T20:26:19.5185686Z 
2025-05-07T20:26:19.5185690Z 
2025-05-07T20:26:19.5188800Z 
2025-05-07T20:26:19.5809611Z libcurand-10.3.9.55  | 43.6 MB   | ########8  |  89% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.5810048Z 
2025-05-07T20:26:19.5810053Z 
2025-05-07T20:26:19.5810056Z 
2025-05-07T20:26:19.5810060Z 
2025-05-07T20:26:19.5810064Z 
2025-05-07T20:26:19.5810067Z 
2025-05-07T20:26:19.5810071Z 
2025-05-07T20:26:19.5810074Z 
2025-05-07T20:26:19.6182332Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######9    |  70% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6182694Z 
2025-05-07T20:26:19.6182700Z 
2025-05-07T20:26:19.6182705Z 
2025-05-07T20:26:19.6182711Z 
2025-05-07T20:26:19.6182716Z 
2025-05-07T20:26:19.6182722Z 
2025-05-07T20:26:19.6182727Z 
2025-05-07T20:26:19.6182733Z 
2025-05-07T20:26:19.6182738Z 
2025-05-07T20:26:19.6188252Z 
2025-05-07T20:26:19.6195008Z gds-tools-1.13.0.11  | 37.9 MB   | ##8        |  28% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6195366Z 
2025-05-07T20:26:19.6195370Z 
2025-05-07T20:26:19.6195385Z 
2025-05-07T20:26:19.6195389Z 
2025-05-07T20:26:19.6195392Z 
2025-05-07T20:26:19.6195396Z 
2025-05-07T20:26:19.6195399Z 
2025-05-07T20:26:19.6195403Z 
2025-05-07T20:26:19.6195407Z 
2025-05-07T20:26:19.6843688Z libcurand-10.3.9.55  | 43.6 MB   | #########5 |  95% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.6844022Z 
2025-05-07T20:26:19.6844026Z 
2025-05-07T20:26:19.6844036Z 
2025-05-07T20:26:19.6844040Z 
2025-05-07T20:26:19.6844044Z 
2025-05-07T20:26:19.6844049Z 
2025-05-07T20:26:19.6844052Z 
2025-05-07T20:26:19.6845626Z 
2025-05-07T20:26:19.7183450Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######4   |  74% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.7183839Z 
2025-05-07T20:26:19.7183845Z 
2025-05-07T20:26:19.7183850Z 
2025-05-07T20:26:19.7183856Z 
2025-05-07T20:26:19.7183861Z 
2025-05-07T20:26:19.7183886Z 
2025-05-07T20:26:19.7183892Z 
2025-05-07T20:26:19.7183897Z 
2025-05-07T20:26:19.7183903Z 
2025-05-07T20:26:19.7185795Z 
2025-05-07T20:26:19.7848390Z gds-tools-1.13.0.11  | 37.9 MB   | ###6       |  36% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.7848890Z 
2025-05-07T20:26:19.7848897Z 
2025-05-07T20:26:19.7848902Z 
2025-05-07T20:26:19.7848907Z 
2025-05-07T20:26:19.7848913Z 
2025-05-07T20:26:19.7848918Z 
2025-05-07T20:26:19.7848923Z 
2025-05-07T20:26:19.7853583Z 
2025-05-07T20:26:19.8184668Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######9   |  79% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.8185030Z 
2025-05-07T20:26:19.8185034Z 
2025-05-07T20:26:19.8185038Z 
2025-05-07T20:26:19.8185041Z 
2025-05-07T20:26:19.8185045Z 
2025-05-07T20:26:19.8185049Z 
2025-05-07T20:26:19.8185052Z 
2025-05-07T20:26:19.8185056Z 
2025-05-07T20:26:19.8185060Z 
2025-05-07T20:26:19.8189863Z 
2025-05-07T20:26:19.8849834Z gds-tools-1.13.0.11  | 37.9 MB   | ####4      |  44% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.8850206Z 
2025-05-07T20:26:19.8850210Z 
2025-05-07T20:26:19.8850214Z 
2025-05-07T20:26:19.8850218Z 
2025-05-07T20:26:19.8850228Z 
2025-05-07T20:26:19.8850232Z 
2025-05-07T20:26:19.8850236Z 
2025-05-07T20:26:19.8852348Z 
2025-05-07T20:26:19.9188456Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########4  |  84% [A[A[A[A[A[A[A[A
2025-05-07T20:26:19.9188983Z 
2025-05-07T20:26:19.9188990Z 
2025-05-07T20:26:19.9188997Z 
2025-05-07T20:26:19.9189003Z 
2025-05-07T20:26:19.9189010Z 
2025-05-07T20:26:19.9189016Z 
2025-05-07T20:26:19.9189022Z 
2025-05-07T20:26:19.9189029Z 
2025-05-07T20:26:19.9189035Z 
2025-05-07T20:26:19.9190587Z 
2025-05-07T20:26:19.9872095Z gds-tools-1.13.0.11  | 37.9 MB   | #####3     |  53% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:19.9872631Z 
2025-05-07T20:26:19.9872638Z 
2025-05-07T20:26:19.9872645Z 
2025-05-07T20:26:19.9872651Z 
2025-05-07T20:26:19.9872658Z 
2025-05-07T20:26:19.9872664Z 
2025-05-07T20:26:19.9872671Z 
2025-05-07T20:26:19.9875747Z 
2025-05-07T20:26:20.0249918Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########9  |  89% [A[A[A[A[A[A[A[A
2025-05-07T20:26:20.0250237Z 
2025-05-07T20:26:20.0250242Z 
2025-05-07T20:26:20.0250245Z 
2025-05-07T20:26:20.0250249Z 
2025-05-07T20:26:20.0250414Z 
2025-05-07T20:26:20.0250419Z 
2025-05-07T20:26:20.0250424Z 
2025-05-07T20:26:20.0250429Z 
2025-05-07T20:26:20.0250434Z 
2025-05-07T20:26:20.0250440Z 
2025-05-07T20:26:20.0950385Z gds-tools-1.13.0.11  | 37.9 MB   | ######1    |  61% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.0950805Z 
2025-05-07T20:26:20.0950811Z 
2025-05-07T20:26:20.0950816Z 
2025-05-07T20:26:20.0950821Z 
2025-05-07T20:26:20.0950826Z 
2025-05-07T20:26:20.0950832Z 
2025-05-07T20:26:20.0950847Z 
2025-05-07T20:26:20.0950852Z 
2025-05-07T20:26:20.1257844Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########3 |  94% [A[A[A[A[A[A[A[A
2025-05-07T20:26:20.1258268Z 
2025-05-07T20:26:20.1258273Z 
2025-05-07T20:26:20.1258279Z 
2025-05-07T20:26:20.1258293Z 
2025-05-07T20:26:20.1258298Z 
2025-05-07T20:26:20.1258303Z 
2025-05-07T20:26:20.1258328Z 
2025-05-07T20:26:20.1258334Z 
2025-05-07T20:26:20.1258339Z 
2025-05-07T20:26:20.1258344Z 
2025-05-07T20:26:20.1951566Z gds-tools-1.13.0.11  | 37.9 MB   | ######9    |  69% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.1951964Z 
2025-05-07T20:26:20.1951968Z 
2025-05-07T20:26:20.1951972Z 
2025-05-07T20:26:20.1951976Z 
2025-05-07T20:26:20.1951979Z 
2025-05-07T20:26:20.1951983Z 
2025-05-07T20:26:20.1951986Z 
2025-05-07T20:26:20.1951990Z 
2025-05-07T20:26:20.2287856Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########8 |  99% [A[A[A[A[A[A[A[A
2025-05-07T20:26:20.2288231Z 
2025-05-07T20:26:20.2288235Z 
2025-05-07T20:26:20.2288239Z 
2025-05-07T20:26:20.2288242Z 
2025-05-07T20:26:20.2288246Z 
2025-05-07T20:26:20.2288249Z 
2025-05-07T20:26:20.2288253Z 
2025-05-07T20:26:20.2288257Z 
2025-05-07T20:26:20.2288260Z 
2025-05-07T20:26:20.2289531Z 
2025-05-07T20:26:20.3291458Z gds-tools-1.13.0.11  | 37.9 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.3291779Z 
2025-05-07T20:26:20.3291809Z 
2025-05-07T20:26:20.3291813Z 
2025-05-07T20:26:20.3291817Z 
2025-05-07T20:26:20.3291820Z 
2025-05-07T20:26:20.3291824Z 
2025-05-07T20:26:20.3291828Z 
2025-05-07T20:26:20.3291831Z 
2025-05-07T20:26:20.3291841Z 
2025-05-07T20:26:20.3292658Z 
2025-05-07T20:26:20.4291968Z gds-tools-1.13.0.11  | 37.9 MB   | ########6  |  87% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:20.4292345Z 
2025-05-07T20:26:20.4292349Z 
2025-05-07T20:26:20.4292353Z 
2025-05-07T20:26:20.4292356Z 
2025-05-07T20:26:20.4292360Z 
2025-05-07T20:26:20.4292364Z 
2025-05-07T20:26:20.4292367Z 
2025-05-07T20:26:20.4292371Z 
2025-05-07T20:26:20.4292375Z 
2025-05-07T20:26:20.4293266Z 
2025-05-07T20:26:21.2470026Z gds-tools-1.13.0.11  | 37.9 MB   | #########5 |  95% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.2470464Z 
2025-05-07T20:26:21.2470470Z 
2025-05-07T20:26:21.2470476Z 
2025-05-07T20:26:21.2470481Z 
2025-05-07T20:26:21.2470487Z 
2025-05-07T20:26:21.2470493Z 
2025-05-07T20:26:21.2470502Z 
2025-05-07T20:26:21.2470510Z 
2025-05-07T20:26:21.2470549Z 
2025-05-07T20:26:21.2924211Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.2924535Z 
2025-05-07T20:26:21.2924539Z 
2025-05-07T20:26:21.2924564Z 
2025-05-07T20:26:21.2924568Z 
2025-05-07T20:26:21.2924571Z 
2025-05-07T20:26:21.2924575Z 
2025-05-07T20:26:21.2924579Z 
2025-05-07T20:26:21.2924590Z 
2025-05-07T20:26:21.2924594Z 
2025-05-07T20:26:21.2924597Z 
2025-05-07T20:26:21.2924601Z 
2025-05-07T20:26:21.3925248Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.3925569Z 
2025-05-07T20:26:21.3925573Z 
2025-05-07T20:26:21.3925577Z 
2025-05-07T20:26:21.3925580Z 
2025-05-07T20:26:21.3925585Z 
2025-05-07T20:26:21.3925589Z 
2025-05-07T20:26:21.3925593Z 
2025-05-07T20:26:21.3925599Z 
2025-05-07T20:26:21.3925602Z 
2025-05-07T20:26:21.3925606Z 
2025-05-07T20:26:21.3929211Z 
2025-05-07T20:26:21.4927619Z python-3.13.0        | 31.5 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.4928211Z 
2025-05-07T20:26:21.4928217Z 
2025-05-07T20:26:21.4928222Z 
2025-05-07T20:26:21.4928225Z 
2025-05-07T20:26:21.4928229Z 
2025-05-07T20:26:21.4928233Z 
2025-05-07T20:26:21.4928404Z 
2025-05-07T20:26:21.4928407Z 
2025-05-07T20:26:21.4928411Z 
2025-05-07T20:26:21.4928415Z 
2025-05-07T20:26:21.4928418Z 
2025-05-07T20:26:21.5929665Z python-3.13.0        | 31.5 MB   | ##3        |  24% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.5929981Z 
2025-05-07T20:26:21.5929986Z 
2025-05-07T20:26:21.5929990Z 
2025-05-07T20:26:21.5929994Z 
2025-05-07T20:26:21.5930006Z 
2025-05-07T20:26:21.5930010Z 
2025-05-07T20:26:21.5930014Z 
2025-05-07T20:26:21.5930017Z 
2025-05-07T20:26:21.5930021Z 
2025-05-07T20:26:21.5930025Z 
2025-05-07T20:26:21.5930029Z 
2025-05-07T20:26:21.5996090Z python-3.13.0        | 31.5 MB   | ###6       |  36% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.5996384Z 
2025-05-07T20:26:21.5999447Z 
2025-05-07T20:26:21.6349029Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:21.6349321Z 
2025-05-07T20:26:21.6349325Z 
2025-05-07T20:26:21.6349329Z 
2025-05-07T20:26:21.6349333Z 
2025-05-07T20:26:21.6349337Z 
2025-05-07T20:26:21.6349340Z 
2025-05-07T20:26:21.6349354Z 
2025-05-07T20:26:21.6349360Z 
2025-05-07T20:26:21.6349373Z 
2025-05-07T20:26:21.6352420Z 
2025-05-07T20:26:21.6451030Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.6451687Z 
2025-05-07T20:26:21.6807856Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:21.6808139Z 
2025-05-07T20:26:21.6808144Z 
2025-05-07T20:26:21.6808147Z 
2025-05-07T20:26:21.6808151Z 
2025-05-07T20:26:21.6808154Z 
2025-05-07T20:26:21.6808159Z 
2025-05-07T20:26:21.6808163Z 
2025-05-07T20:26:21.6808167Z 
2025-05-07T20:26:21.6808171Z 
2025-05-07T20:26:21.6808183Z 
2025-05-07T20:26:21.6808187Z 
2025-05-07T20:26:21.6808190Z 
2025-05-07T20:26:21.6809611Z 
2025-05-07T20:26:21.7034932Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.7035302Z 
2025-05-07T20:26:21.7035317Z 
2025-05-07T20:26:21.7035323Z 
2025-05-07T20:26:21.7035328Z 
2025-05-07T20:26:21.7035333Z 
2025-05-07T20:26:21.7035353Z 
2025-05-07T20:26:21.7035358Z 
2025-05-07T20:26:21.7035363Z 
2025-05-07T20:26:21.7035368Z 
2025-05-07T20:26:21.7035373Z 
2025-05-07T20:26:21.7036736Z 
2025-05-07T20:26:21.7180087Z python-3.13.0        | 31.5 MB   | ####8      |  48% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.7180524Z 
2025-05-07T20:26:21.7180531Z 
2025-05-07T20:26:21.7180536Z 
2025-05-07T20:26:21.7180541Z 
2025-05-07T20:26:21.7180547Z 
2025-05-07T20:26:21.7180552Z 
2025-05-07T20:26:21.7180557Z 
2025-05-07T20:26:21.7180563Z 
2025-05-07T20:26:21.7180568Z 
2025-05-07T20:26:21.7180573Z 
2025-05-07T20:26:21.7180579Z 
2025-05-07T20:26:21.7186015Z 
2025-05-07T20:26:21.7431124Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.7431453Z 
2025-05-07T20:26:21.7431457Z 
2025-05-07T20:26:21.7431475Z 
2025-05-07T20:26:21.7811423Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:21.7811760Z 
2025-05-07T20:26:21.7811766Z 
2025-05-07T20:26:21.7811786Z 
2025-05-07T20:26:21.7811791Z 
2025-05-07T20:26:21.7811796Z 
2025-05-07T20:26:21.7811801Z 
2025-05-07T20:26:21.7811807Z 
2025-05-07T20:26:21.7811812Z 
2025-05-07T20:26:21.7811817Z 
2025-05-07T20:26:21.7811822Z 
2025-05-07T20:26:21.7811827Z 
2025-05-07T20:26:21.7811843Z 
2025-05-07T20:26:21.7815666Z 
2025-05-07T20:26:21.8185555Z cuda-nvcc-tools-12.8 | 24.5 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.8185992Z 
2025-05-07T20:26:21.8186007Z 
2025-05-07T20:26:21.8186011Z 
2025-05-07T20:26:21.8186015Z 
2025-05-07T20:26:21.8186018Z 
2025-05-07T20:26:21.8186022Z 
2025-05-07T20:26:21.8186026Z 
2025-05-07T20:26:21.8186029Z 
2025-05-07T20:26:21.8186033Z 
2025-05-07T20:26:21.8186037Z 
2025-05-07T20:26:21.8186041Z 
2025-05-07T20:26:21.8190578Z 
2025-05-07T20:26:21.8569021Z libnvjitlink-12.8.61 | 28.7 MB   | #          |  10% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.8569548Z 
2025-05-07T20:26:21.8569552Z 
2025-05-07T20:26:21.8569556Z 
2025-05-07T20:26:21.8569727Z 
2025-05-07T20:26:21.8569731Z 
2025-05-07T20:26:21.8569734Z 
2025-05-07T20:26:21.8569738Z 
2025-05-07T20:26:21.8569742Z 
2025-05-07T20:26:21.8569745Z 
2025-05-07T20:26:21.8569749Z 
2025-05-07T20:26:21.8573126Z 
2025-05-07T20:26:21.8812079Z python-3.13.0        | 31.5 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.8812380Z 
2025-05-07T20:26:21.8812384Z 
2025-05-07T20:26:21.8812388Z 
2025-05-07T20:26:21.8812392Z 
2025-05-07T20:26:21.8812395Z 
2025-05-07T20:26:21.8812399Z 
2025-05-07T20:26:21.8812402Z 
2025-05-07T20:26:21.8812406Z 
2025-05-07T20:26:21.8812418Z 
2025-05-07T20:26:21.8812422Z 
2025-05-07T20:26:21.8812425Z 
2025-05-07T20:26:21.8812429Z 
2025-05-07T20:26:21.8814486Z 
2025-05-07T20:26:21.9266699Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.9267112Z 
2025-05-07T20:26:21.9267116Z 
2025-05-07T20:26:21.9267120Z 
2025-05-07T20:26:21.9267123Z 
2025-05-07T20:26:21.9267127Z 
2025-05-07T20:26:21.9267140Z 
2025-05-07T20:26:21.9267143Z 
2025-05-07T20:26:21.9267147Z 
2025-05-07T20:26:21.9267150Z 
2025-05-07T20:26:21.9267154Z 
2025-05-07T20:26:21.9267158Z 
2025-05-07T20:26:21.9269350Z 
2025-05-07T20:26:21.9796737Z libnvjitlink-12.8.61 | 28.7 MB   | ##         |  20% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.9797072Z 
2025-05-07T20:26:21.9797076Z 
2025-05-07T20:26:21.9797080Z 
2025-05-07T20:26:21.9797084Z 
2025-05-07T20:26:21.9797088Z 
2025-05-07T20:26:21.9797092Z 
2025-05-07T20:26:21.9797096Z 
2025-05-07T20:26:21.9797100Z 
2025-05-07T20:26:21.9797104Z 
2025-05-07T20:26:21.9797108Z 
2025-05-07T20:26:21.9798432Z 
2025-05-07T20:26:21.9879819Z python-3.13.0        | 31.5 MB   | #######    |  70% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:21.9880301Z 
2025-05-07T20:26:21.9880305Z 
2025-05-07T20:26:21.9880309Z 
2025-05-07T20:26:21.9880332Z 
2025-05-07T20:26:21.9880335Z 
2025-05-07T20:26:21.9880339Z 
2025-05-07T20:26:21.9880342Z 
2025-05-07T20:26:21.9880346Z 
2025-05-07T20:26:21.9880350Z 
2025-05-07T20:26:21.9880364Z 
2025-05-07T20:26:21.9880368Z 
2025-05-07T20:26:21.9880372Z 
2025-05-07T20:26:21.9883097Z 
2025-05-07T20:26:22.0301617Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.0302071Z 
2025-05-07T20:26:22.0302076Z 
2025-05-07T20:26:22.0302079Z 
2025-05-07T20:26:22.0302083Z 
2025-05-07T20:26:22.0302087Z 
2025-05-07T20:26:22.0302090Z 
2025-05-07T20:26:22.0302094Z 
2025-05-07T20:26:22.0302097Z 
2025-05-07T20:26:22.0302101Z 
2025-05-07T20:26:22.0302104Z 
2025-05-07T20:26:22.0302108Z 
2025-05-07T20:26:22.0305632Z 
2025-05-07T20:26:22.0951038Z libnvjitlink-12.8.61 | 28.7 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.0951379Z 
2025-05-07T20:26:22.0951383Z 
2025-05-07T20:26:22.0951387Z 
2025-05-07T20:26:22.0951390Z 
2025-05-07T20:26:22.0951429Z 
2025-05-07T20:26:22.0951433Z 
2025-05-07T20:26:22.0951436Z 
2025-05-07T20:26:22.0951440Z 
2025-05-07T20:26:22.0951444Z 
2025-05-07T20:26:22.0951447Z 
2025-05-07T20:26:22.0951460Z 
2025-05-07T20:26:22.0951464Z 
2025-05-07T20:26:22.0952890Z 
2025-05-07T20:26:22.1097037Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.1097374Z 
2025-05-07T20:26:22.1097378Z 
2025-05-07T20:26:22.1097381Z 
2025-05-07T20:26:22.1097385Z 
2025-05-07T20:26:22.1097401Z 
2025-05-07T20:26:22.1097404Z 
2025-05-07T20:26:22.1097408Z 
2025-05-07T20:26:22.1097412Z 
2025-05-07T20:26:22.1097415Z 
2025-05-07T20:26:22.1097419Z 
2025-05-07T20:26:22.1101134Z 
2025-05-07T20:26:22.1312548Z python-3.13.0        | 31.5 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.1312951Z 
2025-05-07T20:26:22.1312957Z 
2025-05-07T20:26:22.1312962Z 
2025-05-07T20:26:22.1312967Z 
2025-05-07T20:26:22.1312972Z 
2025-05-07T20:26:22.1312977Z 
2025-05-07T20:26:22.1313263Z 
2025-05-07T20:26:22.1313271Z 
2025-05-07T20:26:22.1313276Z 
2025-05-07T20:26:22.1313281Z 
2025-05-07T20:26:22.1313285Z 
2025-05-07T20:26:22.1314917Z 
2025-05-07T20:26:22.1955190Z libnvjitlink-12.8.61 | 28.7 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.1955531Z 
2025-05-07T20:26:22.1955535Z 
2025-05-07T20:26:22.1955539Z 
2025-05-07T20:26:22.1955542Z 
2025-05-07T20:26:22.1955546Z 
2025-05-07T20:26:22.1955550Z 
2025-05-07T20:26:22.1955554Z 
2025-05-07T20:26:22.1955557Z 
2025-05-07T20:26:22.1955561Z 
2025-05-07T20:26:22.1955565Z 
2025-05-07T20:26:22.1955569Z 
2025-05-07T20:26:22.1955572Z 
2025-05-07T20:26:22.1955576Z 
2025-05-07T20:26:22.2128030Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.2128531Z 
2025-05-07T20:26:22.2128537Z 
2025-05-07T20:26:22.2128542Z 
2025-05-07T20:26:22.2128547Z 
2025-05-07T20:26:22.2128553Z 
2025-05-07T20:26:22.2128558Z 
2025-05-07T20:26:22.2128563Z 
2025-05-07T20:26:22.2128587Z 
2025-05-07T20:26:22.2128593Z 
2025-05-07T20:26:22.2128598Z 
2025-05-07T20:26:22.2128603Z 
2025-05-07T20:26:22.2313101Z python-3.13.0        | 31.5 MB   | ########8  |  89% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.2313740Z 
2025-05-07T20:26:22.2313747Z 
2025-05-07T20:26:22.2313752Z 
2025-05-07T20:26:22.2313757Z 
2025-05-07T20:26:22.2313763Z 
2025-05-07T20:26:22.2313777Z 
2025-05-07T20:26:22.2313783Z 
2025-05-07T20:26:22.2313789Z 
2025-05-07T20:26:22.2313794Z 
2025-05-07T20:26:22.2313799Z 
2025-05-07T20:26:22.2313804Z 
2025-05-07T20:26:22.2317204Z 
2025-05-07T20:26:22.2956136Z libnvjitlink-12.8.61 | 28.7 MB   | ####8      |  49% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.2956486Z 
2025-05-07T20:26:22.2956490Z 
2025-05-07T20:26:22.2956493Z 
2025-05-07T20:26:22.2956497Z 
2025-05-07T20:26:22.2956501Z 
2025-05-07T20:26:22.2956506Z 
2025-05-07T20:26:22.2956510Z 
2025-05-07T20:26:22.2956513Z 
2025-05-07T20:26:22.2956517Z 
2025-05-07T20:26:22.2956521Z 
2025-05-07T20:26:22.2956546Z 
2025-05-07T20:26:22.2956550Z 
2025-05-07T20:26:22.2956554Z 
2025-05-07T20:26:22.3131063Z cuda-nvcc-tools-12.8 | 24.5 MB   | ######7    |  67% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.3131477Z 
2025-05-07T20:26:22.3131483Z 
2025-05-07T20:26:22.3131488Z 
2025-05-07T20:26:22.3131494Z 
2025-05-07T20:26:22.3131499Z 
2025-05-07T20:26:22.3131504Z 
2025-05-07T20:26:22.3131510Z 
2025-05-07T20:26:22.3131515Z 
2025-05-07T20:26:22.3131520Z 
2025-05-07T20:26:22.3131526Z 
2025-05-07T20:26:22.3131530Z 
2025-05-07T20:26:22.3314475Z python-3.13.0        | 31.5 MB   | #########8 |  98% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.3314766Z 
2025-05-07T20:26:22.3314770Z 
2025-05-07T20:26:22.3314774Z 
2025-05-07T20:26:22.3314777Z 
2025-05-07T20:26:22.3314781Z 
2025-05-07T20:26:22.3314793Z 
2025-05-07T20:26:22.3314796Z 
2025-05-07T20:26:22.3314800Z 
2025-05-07T20:26:22.3314803Z 
2025-05-07T20:26:22.3314807Z 
2025-05-07T20:26:22.3314811Z 
2025-05-07T20:26:22.3314818Z 
2025-05-07T20:26:22.3959589Z libnvjitlink-12.8.61 | 28.7 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.3959951Z 
2025-05-07T20:26:22.3959955Z 
2025-05-07T20:26:22.3959959Z 
2025-05-07T20:26:22.3959973Z 
2025-05-07T20:26:22.3959977Z 
2025-05-07T20:26:22.3959981Z 
2025-05-07T20:26:22.3959984Z 
2025-05-07T20:26:22.3959988Z 
2025-05-07T20:26:22.3959992Z 
2025-05-07T20:26:22.3959995Z 
2025-05-07T20:26:22.3959999Z 
2025-05-07T20:26:22.3960003Z 
2025-05-07T20:26:22.3962576Z 
2025-05-07T20:26:22.4316416Z cuda-nvcc-tools-12.8 | 24.5 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.4316748Z 
2025-05-07T20:26:22.4316752Z 
2025-05-07T20:26:22.4316756Z 
2025-05-07T20:26:22.4316760Z 
2025-05-07T20:26:22.4316763Z 
2025-05-07T20:26:22.4316767Z 
2025-05-07T20:26:22.4316771Z 
2025-05-07T20:26:22.4316775Z 
2025-05-07T20:26:22.4316778Z 
2025-05-07T20:26:22.4316790Z 
2025-05-07T20:26:22.4316794Z 
2025-05-07T20:26:22.4318357Z 
2025-05-07T20:26:22.4828723Z libnvjitlink-12.8.61 | 28.7 MB   | #######1   |  71% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.4829101Z 
2025-05-07T20:26:22.4829113Z 
2025-05-07T20:26:22.4829119Z 
2025-05-07T20:26:22.4829124Z 
2025-05-07T20:26:22.4829323Z 
2025-05-07T20:26:22.4829329Z 
2025-05-07T20:26:22.4829334Z 
2025-05-07T20:26:22.4829339Z 
2025-05-07T20:26:22.4968872Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:22.4969200Z 
2025-05-07T20:26:22.4969204Z 
2025-05-07T20:26:22.4969207Z 
2025-05-07T20:26:22.4969211Z 
2025-05-07T20:26:22.4969215Z 
2025-05-07T20:26:22.4969218Z 
2025-05-07T20:26:22.4969222Z 
2025-05-07T20:26:22.4969226Z 
2025-05-07T20:26:22.4969229Z 
2025-05-07T20:26:22.4969242Z 
2025-05-07T20:26:22.4969246Z 
2025-05-07T20:26:22.4969250Z 
2025-05-07T20:26:22.4969253Z 
2025-05-07T20:26:22.5170008Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########1 |  91% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.5170454Z 
2025-05-07T20:26:22.5170459Z 
2025-05-07T20:26:22.5170462Z 
2025-05-07T20:26:22.5170477Z 
2025-05-07T20:26:22.5170480Z 
2025-05-07T20:26:22.5170484Z 
2025-05-07T20:26:22.5170487Z 
2025-05-07T20:26:22.5170491Z 
2025-05-07T20:26:22.5170495Z 
2025-05-07T20:26:22.5170504Z 
2025-05-07T20:26:22.5170508Z 
2025-05-07T20:26:22.5170511Z 
2025-05-07T20:26:22.5170515Z 
2025-05-07T20:26:22.5175053Z 
2025-05-07T20:26:22.5317533Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.5317976Z 
2025-05-07T20:26:22.5317980Z 
2025-05-07T20:26:22.5317990Z 
2025-05-07T20:26:22.5317994Z 
2025-05-07T20:26:22.5317998Z 
2025-05-07T20:26:22.5318002Z 
2025-05-07T20:26:22.5318005Z 
2025-05-07T20:26:22.5318009Z 
2025-05-07T20:26:22.5318012Z 
2025-05-07T20:26:22.5318016Z 
2025-05-07T20:26:22.5318020Z 
2025-05-07T20:26:22.5318024Z 
2025-05-07T20:26:22.6174439Z libnvjitlink-12.8.61 | 28.7 MB   | ########2  |  82% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.6174840Z 
2025-05-07T20:26:22.6174844Z 
2025-05-07T20:26:22.6174848Z 
2025-05-07T20:26:22.6174874Z 
2025-05-07T20:26:22.6174878Z 
2025-05-07T20:26:22.6174893Z 
2025-05-07T20:26:22.6174897Z 
2025-05-07T20:26:22.6174900Z 
2025-05-07T20:26:22.6174904Z 
2025-05-07T20:26:22.6174920Z 
2025-05-07T20:26:22.6174924Z 
2025-05-07T20:26:22.6174927Z 
2025-05-07T20:26:22.6174931Z 
2025-05-07T20:26:22.6174938Z 
2025-05-07T20:26:22.6418971Z cuda-nvvm-tools-12.8 | 23.5 MB   | #          |  10% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.6419440Z 
2025-05-07T20:26:22.6419446Z 
2025-05-07T20:26:22.6419450Z 
2025-05-07T20:26:22.6419461Z 
2025-05-07T20:26:22.6419464Z 
2025-05-07T20:26:22.6419468Z 
2025-05-07T20:26:22.6419472Z 
2025-05-07T20:26:22.6419475Z 
2025-05-07T20:26:22.6419479Z 
2025-05-07T20:26:22.6419482Z 
2025-05-07T20:26:22.6419486Z 
2025-05-07T20:26:22.6419490Z 
2025-05-07T20:26:22.7175965Z libnvjitlink-12.8.61 | 28.7 MB   | #########2 |  93% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.7176405Z 
2025-05-07T20:26:22.7176410Z 
2025-05-07T20:26:22.7176415Z 
2025-05-07T20:26:22.7176438Z 
2025-05-07T20:26:22.7176444Z 
2025-05-07T20:26:22.7176449Z 
2025-05-07T20:26:22.7176454Z 
2025-05-07T20:26:22.7176458Z 
2025-05-07T20:26:22.7176463Z 
2025-05-07T20:26:22.7176476Z 
2025-05-07T20:26:22.7176481Z 
2025-05-07T20:26:22.7176486Z 
2025-05-07T20:26:22.7176491Z 
2025-05-07T20:26:22.7179033Z 
2025-05-07T20:26:22.8179089Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##1        |  22% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.8179572Z 
2025-05-07T20:26:22.8179576Z 
2025-05-07T20:26:22.8179580Z 
2025-05-07T20:26:22.8179584Z 
2025-05-07T20:26:22.8179587Z 
2025-05-07T20:26:22.8179591Z 
2025-05-07T20:26:22.8179595Z 
2025-05-07T20:26:22.8179598Z 
2025-05-07T20:26:22.8179602Z 
2025-05-07T20:26:22.8179606Z 
2025-05-07T20:26:22.8179622Z 
2025-05-07T20:26:22.8179626Z 
2025-05-07T20:26:22.8179629Z 
2025-05-07T20:26:22.8181098Z 
2025-05-07T20:26:22.9187292Z cuda-nvvm-tools-12.8 | 23.5 MB   | ###4       |  35% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:22.9187635Z 
2025-05-07T20:26:22.9187899Z 
2025-05-07T20:26:22.9187904Z 
2025-05-07T20:26:22.9187908Z 
2025-05-07T20:26:22.9187911Z 
2025-05-07T20:26:22.9187916Z 
2025-05-07T20:26:22.9187921Z 
2025-05-07T20:26:22.9188106Z 
2025-05-07T20:26:22.9188111Z 
2025-05-07T20:26:22.9188116Z 
2025-05-07T20:26:22.9188121Z 
2025-05-07T20:26:22.9188140Z 
2025-05-07T20:26:22.9188146Z 
2025-05-07T20:26:22.9188336Z 
2025-05-07T20:26:23.0189860Z cuda-nvvm-tools-12.8 | 23.5 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.0198407Z 
2025-05-07T20:26:23.0198413Z 
2025-05-07T20:26:23.0198416Z 
2025-05-07T20:26:23.0198420Z 
2025-05-07T20:26:23.0198424Z 
2025-05-07T20:26:23.0198437Z 
2025-05-07T20:26:23.0198441Z 
2025-05-07T20:26:23.0198445Z 
2025-05-07T20:26:23.0198449Z 
2025-05-07T20:26:23.0198453Z 
2025-05-07T20:26:23.0198456Z 
2025-05-07T20:26:23.0198460Z 
2025-05-07T20:26:23.0198464Z 
2025-05-07T20:26:23.0198468Z 
2025-05-07T20:26:23.1201057Z cuda-nvvm-tools-12.8 | 23.5 MB   | ######6    |  66% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.1201411Z 
2025-05-07T20:26:23.1201415Z 
2025-05-07T20:26:23.1201419Z 
2025-05-07T20:26:23.1201423Z 
2025-05-07T20:26:23.1201426Z 
2025-05-07T20:26:23.1201438Z 
2025-05-07T20:26:23.1201442Z 
2025-05-07T20:26:23.1201445Z 
2025-05-07T20:26:23.1201449Z 
2025-05-07T20:26:23.1201452Z 
2025-05-07T20:26:23.1201456Z 
2025-05-07T20:26:23.1201460Z 
2025-05-07T20:26:23.1201463Z 
2025-05-07T20:26:23.1204824Z 
2025-05-07T20:26:23.2202600Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########   |  80% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.2202943Z 
2025-05-07T20:26:23.2202947Z 
2025-05-07T20:26:23.2202951Z 
2025-05-07T20:26:23.2202955Z 
2025-05-07T20:26:23.2202958Z 
2025-05-07T20:26:23.2202962Z 
2025-05-07T20:26:23.2202966Z 
2025-05-07T20:26:23.2202969Z 
2025-05-07T20:26:23.2202973Z 
2025-05-07T20:26:23.2202977Z 
2025-05-07T20:26:23.2202980Z 
2025-05-07T20:26:23.2202984Z 
2025-05-07T20:26:23.2202988Z 
2025-05-07T20:26:23.2202991Z 
2025-05-07T20:26:23.4507742Z cuda-nvvm-tools-12.8 | 23.5 MB   | #########5 |  96% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.4508084Z 
2025-05-07T20:26:23.4508088Z 
2025-05-07T20:26:23.4508092Z 
2025-05-07T20:26:23.4508106Z 
2025-05-07T20:26:23.4508110Z 
2025-05-07T20:26:23.4508113Z 
2025-05-07T20:26:23.4508117Z 
2025-05-07T20:26:23.4508121Z 
2025-05-07T20:26:23.4508124Z 
2025-05-07T20:26:23.4508128Z 
2025-05-07T20:26:23.4508132Z 
2025-05-07T20:26:23.4508145Z 
2025-05-07T20:26:23.4508523Z 
2025-05-07T20:26:23.4758503Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.4758827Z 
2025-05-07T20:26:23.4758839Z 
2025-05-07T20:26:23.4758843Z 
2025-05-07T20:26:23.4758846Z 
2025-05-07T20:26:23.4758850Z 
2025-05-07T20:26:23.4758854Z 
2025-05-07T20:26:23.4758858Z 
2025-05-07T20:26:23.4758861Z 
2025-05-07T20:26:23.4758865Z 
2025-05-07T20:26:23.4758869Z 
2025-05-07T20:26:23.4759014Z 
2025-05-07T20:26:23.4992805Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.4993110Z 
2025-05-07T20:26:23.4993114Z 
2025-05-07T20:26:23.4993118Z 
2025-05-07T20:26:23.4993122Z 
2025-05-07T20:26:23.4993125Z 
2025-05-07T20:26:23.4993135Z 
2025-05-07T20:26:23.4993139Z 
2025-05-07T20:26:23.4993142Z 
2025-05-07T20:26:23.4993146Z 
2025-05-07T20:26:23.4993150Z 
2025-05-07T20:26:23.4993153Z 
2025-05-07T20:26:23.4993157Z 
2025-05-07T20:26:23.4993160Z 
2025-05-07T20:26:23.4993164Z 
2025-05-07T20:26:23.4994297Z 
2025-05-07T20:26:23.5342757Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.5343198Z 
2025-05-07T20:26:23.5343204Z 
2025-05-07T20:26:23.5343209Z 
2025-05-07T20:26:23.5343214Z 
2025-05-07T20:26:23.5343219Z 
2025-05-07T20:26:23.5343235Z 
2025-05-07T20:26:23.5343240Z 
2025-05-07T20:26:23.5343245Z 
2025-05-07T20:26:23.5343251Z 
2025-05-07T20:26:23.5343257Z 
2025-05-07T20:26:23.5343261Z 
2025-05-07T20:26:23.5343264Z 
2025-05-07T20:26:23.5343268Z 
2025-05-07T20:26:23.5343272Z 
2025-05-07T20:26:23.5343503Z 
2025-05-07T20:26:23.5343508Z 
2025-05-07T20:26:23.5995642Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.5996236Z 
2025-05-07T20:26:23.5996240Z 
2025-05-07T20:26:23.5996244Z 
2025-05-07T20:26:23.5996247Z 
2025-05-07T20:26:23.5996251Z 
2025-05-07T20:26:23.5996255Z 
2025-05-07T20:26:23.5996258Z 
2025-05-07T20:26:23.5996262Z 
2025-05-07T20:26:23.5996266Z 
2025-05-07T20:26:23.5996270Z 
2025-05-07T20:26:23.5996273Z 
2025-05-07T20:26:23.5996277Z 
2025-05-07T20:26:23.5996281Z 
2025-05-07T20:26:23.5996284Z 
2025-05-07T20:26:23.5997442Z 
2025-05-07T20:26:23.6343784Z cuda-nvvm-impl-12.8. | 20.8 MB   | #4         |  15% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.6344196Z 
2025-05-07T20:26:23.6344202Z 
2025-05-07T20:26:23.6344207Z 
2025-05-07T20:26:23.6344212Z 
2025-05-07T20:26:23.6344217Z 
2025-05-07T20:26:23.6344234Z 
2025-05-07T20:26:23.6344239Z 
2025-05-07T20:26:23.6344244Z 
2025-05-07T20:26:23.6344261Z 
2025-05-07T20:26:23.6344267Z 
2025-05-07T20:26:23.6344272Z 
2025-05-07T20:26:23.6344277Z 
2025-05-07T20:26:23.6344282Z 
2025-05-07T20:26:23.6344287Z 
2025-05-07T20:26:23.6344301Z 
2025-05-07T20:26:23.6344306Z 
2025-05-07T20:26:23.6997145Z cuda-nvcc-dev_linux- | 12.7 MB   | ##5        |  26% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.6997628Z 
2025-05-07T20:26:23.6997634Z 
2025-05-07T20:26:23.6997639Z 
2025-05-07T20:26:23.6997644Z 
2025-05-07T20:26:23.6997649Z 
2025-05-07T20:26:23.6997654Z 
2025-05-07T20:26:23.6997660Z 
2025-05-07T20:26:23.6997665Z 
2025-05-07T20:26:23.6997670Z 
2025-05-07T20:26:23.6997676Z 
2025-05-07T20:26:23.6997681Z 
2025-05-07T20:26:23.6997686Z 
2025-05-07T20:26:23.6997691Z 
2025-05-07T20:26:23.6997696Z 
2025-05-07T20:26:23.7000468Z 
2025-05-07T20:26:23.7431179Z cuda-nvvm-impl-12.8. | 20.8 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.7431640Z 
2025-05-07T20:26:23.7431646Z 
2025-05-07T20:26:23.7431652Z 
2025-05-07T20:26:23.7431680Z 
2025-05-07T20:26:23.7431685Z 
2025-05-07T20:26:23.7431702Z 
2025-05-07T20:26:23.7431708Z 
2025-05-07T20:26:23.7431713Z 
2025-05-07T20:26:23.7431719Z 
2025-05-07T20:26:23.7431734Z 
2025-05-07T20:26:23.7431740Z 
2025-05-07T20:26:23.7431745Z 
2025-05-07T20:26:23.7431750Z 
2025-05-07T20:26:23.7431753Z 
2025-05-07T20:26:23.7431757Z 
2025-05-07T20:26:23.7435962Z 
2025-05-07T20:26:23.7503997Z cuda-nvcc-dev_linux- | 12.7 MB   | #####1     |  51% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.7504473Z 
2025-05-07T20:26:23.7504479Z 
2025-05-07T20:26:23.7504484Z 
2025-05-07T20:26:23.7504489Z 
2025-05-07T20:26:23.7504494Z 
2025-05-07T20:26:23.7504499Z 
2025-05-07T20:26:23.7504504Z 
2025-05-07T20:26:23.7504509Z 
2025-05-07T20:26:23.7504515Z 
2025-05-07T20:26:23.7504520Z 
2025-05-07T20:26:23.7504525Z 
2025-05-07T20:26:23.7510808Z 
2025-05-07T20:26:23.7985284Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.7985761Z 
2025-05-07T20:26:23.7985767Z 
2025-05-07T20:26:23.7985772Z 
2025-05-07T20:26:23.7985777Z 
2025-05-07T20:26:23.7985782Z 
2025-05-07T20:26:23.7985787Z 
2025-05-07T20:26:23.7985794Z 
2025-05-07T20:26:23.7985814Z 
2025-05-07T20:26:23.7985819Z 
2025-05-07T20:26:23.7985824Z 
2025-05-07T20:26:23.7985829Z 
2025-05-07T20:26:23.7985835Z 
2025-05-07T20:26:23.7985840Z 
2025-05-07T20:26:23.7985845Z 
2025-05-07T20:26:23.7985857Z 
2025-05-07T20:26:23.7985861Z 
2025-05-07T20:26:23.7986510Z 
2025-05-07T20:26:23.8095642Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.8096127Z 
2025-05-07T20:26:23.8096133Z 
2025-05-07T20:26:23.8096138Z 
2025-05-07T20:26:23.8096143Z 
2025-05-07T20:26:23.8096148Z 
2025-05-07T20:26:23.8096153Z 
2025-05-07T20:26:23.8096158Z 
2025-05-07T20:26:23.8096164Z 
2025-05-07T20:26:23.8096169Z 
2025-05-07T20:26:23.8096174Z 
2025-05-07T20:26:23.8096179Z 
2025-05-07T20:26:23.8096185Z 
2025-05-07T20:26:23.8096190Z 
2025-05-07T20:26:23.8096427Z 
2025-05-07T20:26:23.8098752Z 
2025-05-07T20:26:23.8469788Z cuda-nvvm-impl-12.8. | 20.8 MB   | ####4      |  45% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.8470516Z 
2025-05-07T20:26:23.8470522Z 
2025-05-07T20:26:23.8470527Z 
2025-05-07T20:26:23.8470533Z 
2025-05-07T20:26:23.8470538Z 
2025-05-07T20:26:23.8470543Z 
2025-05-07T20:26:23.8470548Z 
2025-05-07T20:26:23.8470553Z 
2025-05-07T20:26:23.8470558Z 
2025-05-07T20:26:23.8470564Z 
2025-05-07T20:26:23.8470569Z 
2025-05-07T20:26:23.8470574Z 
2025-05-07T20:26:23.8470587Z 
2025-05-07T20:26:23.8470592Z 
2025-05-07T20:26:23.8470598Z 
2025-05-07T20:26:23.8472079Z 
2025-05-07T20:26:23.8989540Z cuda-nvcc-dev_linux- | 12.7 MB   | #######5   |  76% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.8990016Z 
2025-05-07T20:26:23.8990022Z 
2025-05-07T20:26:23.8990027Z 
2025-05-07T20:26:23.8990032Z 
2025-05-07T20:26:23.8990047Z 
2025-05-07T20:26:23.8990053Z 
2025-05-07T20:26:23.8990058Z 
2025-05-07T20:26:23.8990089Z 
2025-05-07T20:26:23.8990095Z 
2025-05-07T20:26:23.8990100Z 
2025-05-07T20:26:23.8990104Z 
2025-05-07T20:26:23.8990110Z 
2025-05-07T20:26:23.8990115Z 
2025-05-07T20:26:23.8990120Z 
2025-05-07T20:26:23.8990134Z 
2025-05-07T20:26:23.8990139Z 
2025-05-07T20:26:23.8994824Z 
2025-05-07T20:26:23.9171530Z cuda-sanitizer-api-1 | 8.8 MB    | ##4        |  25% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.9171976Z 
2025-05-07T20:26:23.9171981Z 
2025-05-07T20:26:23.9171984Z 
2025-05-07T20:26:23.9171988Z 
2025-05-07T20:26:23.9171992Z 
2025-05-07T20:26:23.9171995Z 
2025-05-07T20:26:23.9171999Z 
2025-05-07T20:26:23.9172003Z 
2025-05-07T20:26:23.9172006Z 
2025-05-07T20:26:23.9172010Z 
2025-05-07T20:26:23.9172014Z 
2025-05-07T20:26:23.9172018Z 
2025-05-07T20:26:23.9172022Z 
2025-05-07T20:26:23.9172025Z 
2025-05-07T20:26:23.9173427Z 
2025-05-07T20:26:23.9992855Z cuda-nvvm-impl-12.8. | 20.8 MB   | #####9     |  59% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.9993276Z 
2025-05-07T20:26:23.9993300Z 
2025-05-07T20:26:23.9993304Z 
2025-05-07T20:26:23.9993313Z 
2025-05-07T20:26:23.9993317Z 
2025-05-07T20:26:23.9993320Z 
2025-05-07T20:26:23.9993324Z 
2025-05-07T20:26:23.9993335Z 
2025-05-07T20:26:23.9993339Z 
2025-05-07T20:26:23.9993342Z 
2025-05-07T20:26:23.9993346Z 
2025-05-07T20:26:23.9993356Z 
2025-05-07T20:26:23.9993359Z 
2025-05-07T20:26:23.9993363Z 
2025-05-07T20:26:23.9993366Z 
2025-05-07T20:26:23.9993370Z 
2025-05-07T20:26:23.9996619Z 
2025-05-07T20:26:24.0199614Z cuda-sanitizer-api-1 | 8.8 MB    | ######     |  60% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0200064Z 
2025-05-07T20:26:24.0200076Z 
2025-05-07T20:26:24.0200079Z 
2025-05-07T20:26:24.0200083Z 
2025-05-07T20:26:24.0200087Z 
2025-05-07T20:26:24.0200090Z 
2025-05-07T20:26:24.0200094Z 
2025-05-07T20:26:24.0200098Z 
2025-05-07T20:26:24.0200101Z 
2025-05-07T20:26:24.0200105Z 
2025-05-07T20:26:24.0200108Z 
2025-05-07T20:26:24.0200201Z 
2025-05-07T20:26:24.0200205Z 
2025-05-07T20:26:24.0200209Z 
2025-05-07T20:26:24.0201442Z 
2025-05-07T20:26:24.0993053Z cuda-nvvm-impl-12.8. | 20.8 MB   | #######3   |  74% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0993533Z 
2025-05-07T20:26:24.0993555Z 
2025-05-07T20:26:24.0993560Z 
2025-05-07T20:26:24.0993566Z 
2025-05-07T20:26:24.0993571Z 
2025-05-07T20:26:24.0993576Z 
2025-05-07T20:26:24.0993581Z 
2025-05-07T20:26:24.0993587Z 
2025-05-07T20:26:24.0993592Z 
2025-05-07T20:26:24.0993597Z 
2025-05-07T20:26:24.0993602Z 
2025-05-07T20:26:24.0993607Z 
2025-05-07T20:26:24.0993612Z 
2025-05-07T20:26:24.0993618Z 
2025-05-07T20:26:24.0993623Z 
2025-05-07T20:26:24.0993628Z 
2025-05-07T20:26:24.0997717Z 
2025-05-07T20:26:24.1082694Z cuda-sanitizer-api-1 | 8.8 MB    | #########8 |  98% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.1083173Z 
2025-05-07T20:26:24.1083179Z 
2025-05-07T20:26:24.1083184Z 
2025-05-07T20:26:24.1083189Z 
2025-05-07T20:26:24.1083194Z 
2025-05-07T20:26:24.1083199Z 
2025-05-07T20:26:24.1083205Z 
2025-05-07T20:26:24.1083525Z 
2025-05-07T20:26:24.1083532Z 
2025-05-07T20:26:24.1083547Z 
2025-05-07T20:26:24.1083552Z 
2025-05-07T20:26:24.1083558Z 
2025-05-07T20:26:24.1083563Z 
2025-05-07T20:26:24.1085173Z 
2025-05-07T20:26:24.1203502Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.1203964Z 
2025-05-07T20:26:24.1203969Z 
2025-05-07T20:26:24.1203974Z 
2025-05-07T20:26:24.1203980Z 
2025-05-07T20:26:24.1203985Z 
2025-05-07T20:26:24.1203990Z 
2025-05-07T20:26:24.1203995Z 
2025-05-07T20:26:24.1204000Z 
2025-05-07T20:26:24.1204005Z 
2025-05-07T20:26:24.1204010Z 
2025-05-07T20:26:24.1204016Z 
2025-05-07T20:26:24.1204021Z 
2025-05-07T20:26:24.1204026Z 
2025-05-07T20:26:24.1204031Z 
2025-05-07T20:26:24.1204036Z 
2025-05-07T20:26:24.1532517Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########9  |  90% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.1532973Z 
2025-05-07T20:26:24.1532979Z 
2025-05-07T20:26:24.1532984Z 
2025-05-07T20:26:24.1532989Z 
2025-05-07T20:26:24.1533013Z 
2025-05-07T20:26:24.1533019Z 
2025-05-07T20:26:24.1533024Z 
2025-05-07T20:26:24.1533029Z 
2025-05-07T20:26:24.1533034Z 
2025-05-07T20:26:24.1533039Z 
2025-05-07T20:26:24.1533065Z 
2025-05-07T20:26:24.1533071Z 
2025-05-07T20:26:24.1533076Z 
2025-05-07T20:26:24.1533081Z 
2025-05-07T20:26:24.1533086Z 
2025-05-07T20:26:24.1533092Z 
2025-05-07T20:26:24.1533097Z 
2025-05-07T20:26:24.1533102Z 
2025-05-07T20:26:24.2533452Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.2533929Z 
2025-05-07T20:26:24.2533934Z 
2025-05-07T20:26:24.2533940Z 
2025-05-07T20:26:24.2533945Z 
2025-05-07T20:26:24.2533950Z 
2025-05-07T20:26:24.2533955Z 
2025-05-07T20:26:24.2533960Z 
2025-05-07T20:26:24.2533965Z 
2025-05-07T20:26:24.2533971Z 
2025-05-07T20:26:24.2533976Z 
2025-05-07T20:26:24.2533981Z 
2025-05-07T20:26:24.2533986Z 
2025-05-07T20:26:24.2533991Z 
2025-05-07T20:26:24.2533996Z 
2025-05-07T20:26:24.2534001Z 
2025-05-07T20:26:24.2534026Z 
2025-05-07T20:26:24.2534032Z 
2025-05-07T20:26:24.2534037Z 
2025-05-07T20:26:24.3700986Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ######6    |  67% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.3701486Z 
2025-05-07T20:26:24.3701491Z 
2025-05-07T20:26:24.3701496Z 
2025-05-07T20:26:24.3701501Z 
2025-05-07T20:26:24.3701516Z 
2025-05-07T20:26:24.3701521Z 
2025-05-07T20:26:24.3701526Z 
2025-05-07T20:26:24.3701531Z 
2025-05-07T20:26:24.3701536Z 
2025-05-07T20:26:24.3701541Z 
2025-05-07T20:26:24.3701546Z 
2025-05-07T20:26:24.3701551Z 
2025-05-07T20:26:24.3701556Z 
2025-05-07T20:26:24.3701561Z 
2025-05-07T20:26:24.3701566Z 
2025-05-07T20:26:24.3703151Z 
2025-05-07T20:26:24.4070069Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.4070553Z 
2025-05-07T20:26:24.4070559Z 
2025-05-07T20:26:24.4070564Z 
2025-05-07T20:26:24.4070569Z 
2025-05-07T20:26:24.4070574Z 
2025-05-07T20:26:24.4070580Z 
2025-05-07T20:26:24.4070604Z 
2025-05-07T20:26:24.4070610Z 
2025-05-07T20:26:24.4070616Z 
2025-05-07T20:26:24.4070621Z 
2025-05-07T20:26:24.4070626Z 
2025-05-07T20:26:24.4070631Z 
2025-05-07T20:26:24.4070636Z 
2025-05-07T20:26:24.4070660Z 
2025-05-07T20:26:24.4070665Z 
2025-05-07T20:26:24.4070670Z 
2025-05-07T20:26:24.4070675Z 
2025-05-07T20:26:24.4070680Z 
2025-05-07T20:26:24.4072568Z 
2025-05-07T20:26:24.4203880Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.4204189Z 
2025-05-07T20:26:24.4204193Z 
2025-05-07T20:26:24.4204197Z 
2025-05-07T20:26:24.4204201Z 
2025-05-07T20:26:24.4204211Z 
2025-05-07T20:26:24.4204215Z 
2025-05-07T20:26:24.4204218Z 
2025-05-07T20:26:24.4204222Z 
2025-05-07T20:26:24.4204225Z 
2025-05-07T20:26:24.4204229Z 
2025-05-07T20:26:24.4204232Z 
2025-05-07T20:26:24.4204236Z 
2025-05-07T20:26:24.4204240Z 
2025-05-07T20:26:24.4204244Z 
2025-05-07T20:26:24.4204247Z 
2025-05-07T20:26:24.4204251Z 
2025-05-07T20:26:24.4204255Z 
2025-05-07T20:26:24.4677946Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.4678373Z 
2025-05-07T20:26:24.4678377Z 
2025-05-07T20:26:24.4678381Z 
2025-05-07T20:26:24.4678540Z 
2025-05-07T20:26:24.4678544Z 
2025-05-07T20:26:24.4678547Z 
2025-05-07T20:26:24.4678551Z 
2025-05-07T20:26:24.4678555Z 
2025-05-07T20:26:24.4678558Z 
2025-05-07T20:26:24.4678562Z 
2025-05-07T20:26:24.4678565Z 
2025-05-07T20:26:24.4678569Z 
2025-05-07T20:26:24.4678573Z 
2025-05-07T20:26:24.4678583Z 
2025-05-07T20:26:24.4678587Z 
2025-05-07T20:26:24.4678590Z 
2025-05-07T20:26:24.4678594Z 
2025-05-07T20:26:24.4678598Z 
2025-05-07T20:26:24.5072266Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.5072624Z 
2025-05-07T20:26:24.5072628Z 
2025-05-07T20:26:24.5072631Z 
2025-05-07T20:26:24.5072635Z 
2025-05-07T20:26:24.5072638Z 
2025-05-07T20:26:24.5072642Z 
2025-05-07T20:26:24.5072646Z 
2025-05-07T20:26:24.5072650Z 
2025-05-07T20:26:24.5072674Z 
2025-05-07T20:26:24.5072678Z 
2025-05-07T20:26:24.5072681Z 
2025-05-07T20:26:24.5072685Z 
2025-05-07T20:26:24.5072689Z 
2025-05-07T20:26:24.5072692Z 
2025-05-07T20:26:24.5072703Z 
2025-05-07T20:26:24.5072706Z 
2025-05-07T20:26:24.5072710Z 
2025-05-07T20:26:24.5072713Z 
2025-05-07T20:26:24.5073241Z 
2025-05-07T20:26:24.6593429Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.6593737Z 
2025-05-07T20:26:24.6593741Z 
2025-05-07T20:26:24.6593744Z 
2025-05-07T20:26:24.6593748Z 
2025-05-07T20:26:24.6593752Z 
2025-05-07T20:26:24.6593756Z 
2025-05-07T20:26:24.6593759Z 
2025-05-07T20:26:24.6593771Z 
2025-05-07T20:26:24.6593775Z 
2025-05-07T20:26:24.6593779Z 
2025-05-07T20:26:24.6593782Z 
2025-05-07T20:26:24.6593786Z 
2025-05-07T20:26:24.6593790Z 
2025-05-07T20:26:24.6593793Z 
2025-05-07T20:26:24.6593797Z 
2025-05-07T20:26:24.6593801Z 
2025-05-07T20:26:24.6593804Z 
2025-05-07T20:26:24.6593808Z 
2025-05-07T20:26:24.6599279Z 
2025-05-07T20:26:24.8339313Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.8339633Z 
2025-05-07T20:26:24.8339637Z 
2025-05-07T20:26:24.8339640Z 
2025-05-07T20:26:24.8339654Z 
2025-05-07T20:26:24.8339657Z 
2025-05-07T20:26:24.8339661Z 
2025-05-07T20:26:24.8339665Z 
2025-05-07T20:26:24.8339668Z 
2025-05-07T20:26:24.8339672Z 
2025-05-07T20:26:24.8339676Z 
2025-05-07T20:26:24.8339679Z 
2025-05-07T20:26:24.8339683Z 
2025-05-07T20:26:24.8339687Z 
2025-05-07T20:26:24.8339690Z 
2025-05-07T20:26:24.8344409Z 
2025-05-07T20:26:25.1247198Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.1247539Z 
2025-05-07T20:26:25.1247543Z 
2025-05-07T20:26:25.1247547Z 
2025-05-07T20:26:25.1247551Z 
2025-05-07T20:26:25.1247555Z 
2025-05-07T20:26:25.1247562Z 
2025-05-07T20:26:25.4414229Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:25.4414550Z 
2025-05-07T20:26:25.4414562Z 
2025-05-07T20:26:25.4414591Z 
2025-05-07T20:26:25.4414595Z 
2025-05-07T20:26:25.4414608Z 
2025-05-07T20:26:25.4414612Z 
2025-05-07T20:26:25.4415193Z 
2025-05-07T20:26:26.2241510Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:26.2241955Z 
2025-05-07T20:26:26.2241961Z 
2025-05-07T20:26:26.2241966Z 
2025-05-07T20:26:26.2241971Z 
2025-05-07T20:26:26.2241976Z 
2025-05-07T20:26:26.2241981Z 
2025-05-07T20:26:26.2241986Z 
2025-05-07T20:26:26.2241991Z 
2025-05-07T20:26:26.2241996Z 
2025-05-07T20:26:26.2242000Z 
2025-05-07T20:26:26.6457180Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.6491463Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:26.6491829Z 
2025-05-07T20:26:26.6491833Z 
2025-05-07T20:26:26.6491837Z 
2025-05-07T20:26:26.6491840Z 
2025-05-07T20:26:26.6491844Z 
2025-05-07T20:26:26.6491848Z 
2025-05-07T20:26:26.6491851Z 
2025-05-07T20:26:26.6491855Z 
2025-05-07T20:26:26.6491964Z 
2025-05-07T20:26:27.0618841Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.0619182Z 
2025-05-07T20:26:27.0619186Z 
2025-05-07T20:26:27.0619190Z 
2025-05-07T20:26:27.0619406Z 
2025-05-07T20:26:27.0619410Z 
2025-05-07T20:26:27.4049842Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:27.4050159Z 
2025-05-07T20:26:27.4050163Z 
2025-05-07T20:26:27.4050167Z 
2025-05-07T20:26:27.4050170Z 
2025-05-07T20:26:27.4050174Z 
2025-05-07T20:26:27.4050177Z 
2025-05-07T20:26:27.4050181Z 
2025-05-07T20:26:27.4050185Z 
2025-05-07T20:26:27.4050188Z 
2025-05-07T20:26:27.4050192Z 
2025-05-07T20:26:27.4050195Z 
2025-05-07T20:26:27.4050199Z 
2025-05-07T20:26:27.4050203Z 
2025-05-07T20:26:27.6038480Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.6038843Z 
2025-05-07T20:26:27.6038849Z 
2025-05-07T20:26:27.6038853Z 
2025-05-07T20:26:27.6038860Z 
2025-05-07T20:26:27.6038864Z 
2025-05-07T20:26:27.6038869Z 
2025-05-07T20:26:27.6038903Z 
2025-05-07T20:26:27.6038907Z 
2025-05-07T20:26:28.1502633Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1502970Z 
2025-05-07T20:26:28.1502974Z 
2025-05-07T20:26:28.1502978Z 
2025-05-07T20:26:28.1502982Z 
2025-05-07T20:26:28.1502985Z 
2025-05-07T20:26:28.1502989Z 
2025-05-07T20:26:28.1502993Z 
2025-05-07T20:26:28.1502996Z 
2025-05-07T20:26:28.1503000Z 
2025-05-07T20:26:28.1503004Z 
2025-05-07T20:26:28.1503008Z 
2025-05-07T20:26:28.1503015Z 
2025-05-07T20:26:28.6216693Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6217152Z 
2025-05-07T20:26:28.6217156Z 
2025-05-07T20:26:28.6217160Z 
2025-05-07T20:26:28.6217165Z 
2025-05-07T20:26:28.6217169Z 
2025-05-07T20:26:28.6217183Z 
2025-05-07T20:26:28.6217186Z 
2025-05-07T20:26:28.6217190Z 
2025-05-07T20:26:28.6217194Z 
2025-05-07T20:26:28.6217197Z 
2025-05-07T20:26:28.6217201Z 
2025-05-07T20:26:28.6217205Z 
2025-05-07T20:26:28.6217237Z 
2025-05-07T20:26:28.6217241Z 
2025-05-07T20:26:28.8427867Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8428363Z 
2025-05-07T20:26:28.8428371Z 
2025-05-07T20:26:28.8428375Z 
2025-05-07T20:26:28.8428378Z 
2025-05-07T20:26:28.8428382Z 
2025-05-07T20:26:28.8428386Z 
2025-05-07T20:26:28.8428390Z 
2025-05-07T20:26:28.8428393Z 
2025-05-07T20:26:28.8428397Z 
2025-05-07T20:26:28.8428401Z 
2025-05-07T20:26:28.8428405Z 
2025-05-07T20:26:28.9978311Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9978783Z 
2025-05-07T20:26:28.9978790Z 
2025-05-07T20:26:28.9978796Z 
2025-05-07T20:26:28.9978801Z 
2025-05-07T20:26:28.9978806Z 
2025-05-07T20:26:28.9978811Z 
2025-05-07T20:26:28.9978817Z 
2025-05-07T20:26:28.9978822Z 
2025-05-07T20:26:28.9978827Z 
2025-05-07T20:26:28.9978834Z 
2025-05-07T20:26:28.9978839Z 
2025-05-07T20:26:28.9978844Z 
2025-05-07T20:26:28.9978858Z 
2025-05-07T20:26:28.9978897Z 
2025-05-07T20:26:28.9978903Z 
2025-05-07T20:26:28.9978908Z 
2025-05-07T20:26:29.0308898Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0309295Z 
2025-05-07T20:26:29.0309299Z 
2025-05-07T20:26:29.0309302Z 
2025-05-07T20:26:29.0309306Z 
2025-05-07T20:26:29.0309310Z 
2025-05-07T20:26:29.0309314Z 
2025-05-07T20:26:29.0309317Z 
2025-05-07T20:26:29.0309321Z 
2025-05-07T20:26:29.0309325Z 
2025-05-07T20:26:29.0309328Z 
2025-05-07T20:26:29.0309332Z 
2025-05-07T20:26:29.0309336Z 
2025-05-07T20:26:29.0309339Z 
2025-05-07T20:26:29.0309343Z 
2025-05-07T20:26:29.0309347Z 
2025-05-07T20:26:29.0309350Z 
2025-05-07T20:26:29.0309354Z 
2025-05-07T20:26:29.0309358Z 
2025-05-07T20:26:29.0580431Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0580783Z 
2025-05-07T20:26:29.0580787Z 
2025-05-07T20:26:29.0580791Z 
2025-05-07T20:26:29.0580794Z 
2025-05-07T20:26:29.0580798Z 
2025-05-07T20:26:29.0581053Z 
2025-05-07T20:26:29.0581058Z 
2025-05-07T20:26:29.0581062Z 
2025-05-07T20:26:29.0581066Z 
2025-05-07T20:26:29.0581079Z 
2025-05-07T20:26:29.0581082Z 
2025-05-07T20:26:29.0581249Z 
2025-05-07T20:26:29.0581252Z 
2025-05-07T20:26:29.0581256Z 
2025-05-07T20:26:29.0581259Z 
2025-05-07T20:26:29.0581263Z 
2025-05-07T20:26:29.0581269Z 
2025-05-07T20:26:29.2738736Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2739114Z 
2025-05-07T20:26:29.2739118Z 
2025-05-07T20:26:29.2739122Z 
2025-05-07T20:26:29.2739126Z 
2025-05-07T20:26:29.2739130Z 
2025-05-07T20:26:29.2739133Z 
2025-05-07T20:26:29.2739137Z 
2025-05-07T20:26:29.2739140Z 
2025-05-07T20:26:29.2739144Z 
2025-05-07T20:26:29.2739148Z 
2025-05-07T20:26:29.2739157Z 
2025-05-07T20:26:29.2739161Z 
2025-05-07T20:26:29.2739165Z 
2025-05-07T20:26:29.2739168Z 
2025-05-07T20:26:29.2739172Z 
2025-05-07T20:26:29.2739176Z 
2025-05-07T20:26:29.2739179Z 
2025-05-07T20:26:29.2739206Z 
2025-05-07T20:26:29.2739210Z 
2025-05-07T20:26:29.8372052Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8372372Z 
2025-05-07T20:26:29.8372402Z 
2025-05-07T20:26:29.8372406Z 
2025-05-07T20:26:29.8372409Z 
2025-05-07T20:26:29.8372413Z 
2025-05-07T20:26:29.8372416Z 
2025-05-07T20:26:29.8372420Z 
2025-05-07T20:26:29.8372424Z 
2025-05-07T20:26:29.8372428Z 
2025-05-07T20:26:29.8372432Z 
2025-05-07T20:26:29.8372436Z 
2025-05-07T20:26:29.8372440Z 
2025-05-07T20:26:29.8372444Z 
2025-05-07T20:26:29.8372448Z 
2025-05-07T20:26:29.8372451Z 
2025-05-07T20:26:34.4214100Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.4214458Z 
2025-05-07T20:26:35.7843751Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:35.7852587Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:35.7852864Z 
2025-05-07T20:26:35.7852869Z 
2025-05-07T20:26:35.7852873Z 
2025-05-07T20:26:35.7852899Z 
2025-05-07T20:26:35.7852903Z 
2025-05-07T20:26:35.7852913Z 
2025-05-07T20:26:35.7852917Z 
2025-05-07T20:26:35.7852921Z 
2025-05-07T20:26:35.7852925Z 
2025-05-07T20:26:35.7852939Z 
2025-05-07T20:26:35.7852943Z 
2025-05-07T20:26:35.7852947Z 
2025-05-07T20:26:35.7852950Z 
2025-05-07T20:26:35.7852954Z 
2025-05-07T20:26:35.7852958Z 
2025-05-07T20:26:35.7852961Z 
2025-05-07T20:26:35.7852965Z 
2025-05-07T20:26:35.7852968Z 
2025-05-07T20:26:35.7852972Z 
2025-05-07T20:26:35.7853059Z                       
2025-05-07T20:26:35.7853467Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7853931Z                                                      
2025-05-07T20:26:35.7854140Z 
2025-05-07T20:26:35.7854392Z                                                      [A
2025-05-07T20:26:35.7854645Z 
2025-05-07T20:26:35.7854696Z 
2025-05-07T20:26:35.7854923Z                                                      [A[A
2025-05-07T20:26:35.7855159Z 
2025-05-07T20:26:35.7855163Z 
2025-05-07T20:26:35.7855174Z 
2025-05-07T20:26:35.7855580Z                                                      [A[A[A
2025-05-07T20:26:35.7855818Z 
2025-05-07T20:26:35.7855822Z 
2025-05-07T20:26:35.7855836Z 
2025-05-07T20:26:35.7855847Z 
2025-05-07T20:26:35.7856232Z                                                      [A[A[A[A
2025-05-07T20:26:35.7856548Z 
2025-05-07T20:26:35.7856554Z 
2025-05-07T20:26:35.7856559Z 
2025-05-07T20:26:35.7856564Z 
2025-05-07T20:26:35.7856569Z 
2025-05-07T20:26:35.7856969Z                                                      [A[A[A[A[A
2025-05-07T20:26:35.7857285Z 
2025-05-07T20:26:35.7857290Z 
2025-05-07T20:26:35.7857296Z 
2025-05-07T20:26:35.7857301Z 
2025-05-07T20:26:35.7857306Z 
2025-05-07T20:26:35.7857311Z 
2025-05-07T20:26:35.7857612Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:35.7857919Z 
2025-05-07T20:26:35.7857924Z 
2025-05-07T20:26:35.7857929Z 
2025-05-07T20:26:35.7857934Z 
2025-05-07T20:26:35.7857940Z 
2025-05-07T20:26:35.7858239Z 
2025-05-07T20:26:35.7858248Z 
2025-05-07T20:26:35.7858531Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:35.7858843Z 
2025-05-07T20:26:35.7859010Z 
2025-05-07T20:26:35.7859015Z 
2025-05-07T20:26:35.7859020Z 
2025-05-07T20:26:35.7859025Z 
2025-05-07T20:26:35.7859030Z 
2025-05-07T20:26:35.7859036Z 
2025-05-07T20:26:35.7859041Z 
2025-05-07T20:26:35.7859316Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7859634Z 
2025-05-07T20:26:35.7859639Z 
2025-05-07T20:26:35.7859644Z 
2025-05-07T20:26:35.7859649Z 
2025-05-07T20:26:35.7859654Z 
2025-05-07T20:26:35.7859660Z 
2025-05-07T20:26:35.7859673Z 
2025-05-07T20:26:35.7859678Z 
2025-05-07T20:26:35.7859684Z 
2025-05-07T20:26:35.7859951Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7860265Z 
2025-05-07T20:26:35.7860270Z 
2025-05-07T20:26:35.7860275Z 
2025-05-07T20:26:35.7860288Z 
2025-05-07T20:26:35.7860301Z 
2025-05-07T20:26:35.7860306Z 
2025-05-07T20:26:35.7860312Z 
2025-05-07T20:26:35.7860317Z 
2025-05-07T20:26:35.7860322Z 
2025-05-07T20:26:35.7860327Z 
2025-05-07T20:26:35.7860592Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7860927Z 
2025-05-07T20:26:35.7860933Z 
2025-05-07T20:26:35.7860938Z 
2025-05-07T20:26:35.7860942Z 
2025-05-07T20:26:35.7860947Z 
2025-05-07T20:26:35.7860952Z 
2025-05-07T20:26:35.7860957Z 
2025-05-07T20:26:35.7860962Z 
2025-05-07T20:26:35.7860967Z 
2025-05-07T20:26:35.7860973Z 
2025-05-07T20:26:35.7860977Z 
2025-05-07T20:26:35.7861249Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7861581Z 
2025-05-07T20:26:35.7861586Z 
2025-05-07T20:26:35.7861592Z 
2025-05-07T20:26:35.7861596Z 
2025-05-07T20:26:35.7861602Z 
2025-05-07T20:26:35.7861607Z 
2025-05-07T20:26:35.7861612Z 
2025-05-07T20:26:35.7861617Z 
2025-05-07T20:26:35.7861623Z 
2025-05-07T20:26:35.7861628Z 
2025-05-07T20:26:35.7861640Z 
2025-05-07T20:26:35.7861646Z 
2025-05-07T20:26:35.7862040Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7862402Z 
2025-05-07T20:26:35.7862408Z 
2025-05-07T20:26:35.7862413Z 
2025-05-07T20:26:35.7862418Z 
2025-05-07T20:26:35.7862423Z 
2025-05-07T20:26:35.7862429Z 
2025-05-07T20:26:35.7862434Z 
2025-05-07T20:26:35.7862440Z 
2025-05-07T20:26:35.7862445Z 
2025-05-07T20:26:35.7862450Z 
2025-05-07T20:26:35.7862455Z 
2025-05-07T20:26:35.7862460Z 
2025-05-07T20:26:35.7862473Z 
2025-05-07T20:26:35.7862763Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7863091Z 
2025-05-07T20:26:35.7863096Z 
2025-05-07T20:26:35.7863101Z 
2025-05-07T20:26:35.7863106Z 
2025-05-07T20:26:35.7863112Z 
2025-05-07T20:26:35.7863117Z 
2025-05-07T20:26:35.7863122Z 
2025-05-07T20:26:35.7863128Z 
2025-05-07T20:26:35.7863132Z 
2025-05-07T20:26:35.7863137Z 
2025-05-07T20:26:35.7863151Z 
2025-05-07T20:26:35.7863164Z 
2025-05-07T20:26:35.7863169Z 
2025-05-07T20:26:35.7863174Z 
2025-05-07T20:26:35.7863466Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7863804Z 
2025-05-07T20:26:35.7863826Z 
2025-05-07T20:26:35.7863831Z 
2025-05-07T20:26:35.7863836Z 
2025-05-07T20:26:35.7863841Z 
2025-05-07T20:26:35.7863847Z 
2025-05-07T20:26:35.7863852Z 
2025-05-07T20:26:35.7863857Z 
2025-05-07T20:26:35.7863863Z 
2025-05-07T20:26:35.7863868Z 
2025-05-07T20:26:35.7863874Z 
2025-05-07T20:26:35.7863879Z 
2025-05-07T20:26:35.7863884Z 
2025-05-07T20:26:35.7863888Z 
2025-05-07T20:26:35.7863893Z 
2025-05-07T20:26:35.7864187Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7864530Z 
2025-05-07T20:26:35.7864535Z 
2025-05-07T20:26:35.7864540Z 
2025-05-07T20:26:35.7864552Z 
2025-05-07T20:26:35.7864557Z 
2025-05-07T20:26:35.7864562Z 
2025-05-07T20:26:35.7864567Z 
2025-05-07T20:26:35.7864572Z 
2025-05-07T20:26:35.7864732Z 
2025-05-07T20:26:35.7864740Z 
2025-05-07T20:26:35.7864745Z 
2025-05-07T20:26:35.7864750Z 
2025-05-07T20:26:35.7864756Z 
2025-05-07T20:26:35.7864769Z 
2025-05-07T20:26:35.7864861Z 
2025-05-07T20:26:35.7864866Z 
2025-05-07T20:26:35.7865182Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7865526Z 
2025-05-07T20:26:35.7865540Z 
2025-05-07T20:26:35.7865546Z 
2025-05-07T20:26:35.7865551Z 
2025-05-07T20:26:35.7865556Z 
2025-05-07T20:26:35.7865561Z 
2025-05-07T20:26:35.7865566Z 
2025-05-07T20:26:35.7865572Z 
2025-05-07T20:26:35.7865577Z 
2025-05-07T20:26:35.7865582Z 
2025-05-07T20:26:35.7865587Z 
2025-05-07T20:26:35.7865592Z 
2025-05-07T20:26:35.7865597Z 
2025-05-07T20:26:35.7865603Z 
2025-05-07T20:26:35.7865608Z 
2025-05-07T20:26:35.7865613Z 
2025-05-07T20:26:35.7865618Z 
2025-05-07T20:26:35.7865930Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7866283Z 
2025-05-07T20:26:35.7866289Z 
2025-05-07T20:26:35.7866294Z 
2025-05-07T20:26:35.7866299Z 
2025-05-07T20:26:35.7866304Z 
2025-05-07T20:26:35.7866309Z 
2025-05-07T20:26:35.7866321Z 
2025-05-07T20:26:35.7866327Z 
2025-05-07T20:26:35.7866332Z 
2025-05-07T20:26:35.7866337Z 
2025-05-07T20:26:35.7866342Z 
2025-05-07T20:26:35.7866355Z 
2025-05-07T20:26:35.7866361Z 
2025-05-07T20:26:35.7866366Z 
2025-05-07T20:26:35.7866369Z 
2025-05-07T20:26:35.7866373Z 
2025-05-07T20:26:35.7866377Z 
2025-05-07T20:26:35.7866380Z 
2025-05-07T20:26:35.7867096Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7867433Z 
2025-05-07T20:26:35.7867440Z 
2025-05-07T20:26:35.7867618Z [A
2025-05-07T20:26:35.7867775Z 
2025-05-07T20:26:35.7867784Z 
2025-05-07T20:26:35.7868377Z [A[A
2025-05-07T20:26:35.7868538Z 
2025-05-07T20:26:35.7868547Z 
2025-05-07T20:26:35.7868552Z 
2025-05-07T20:26:35.7869252Z [A[A[A
2025-05-07T20:26:35.7869422Z 
2025-05-07T20:26:35.7869437Z 
2025-05-07T20:26:35.7869443Z 
2025-05-07T20:26:35.7869452Z 
2025-05-07T20:26:35.7869915Z [A[A[A[A
2025-05-07T20:26:35.7870085Z 
2025-05-07T20:26:35.7870091Z 
2025-05-07T20:26:35.7870103Z 
2025-05-07T20:26:35.7870112Z 
2025-05-07T20:26:35.7870118Z 
2025-05-07T20:26:35.7870598Z [A[A[A[A[A
2025-05-07T20:26:35.7870775Z 
2025-05-07T20:26:35.7870780Z 
2025-05-07T20:26:35.7870785Z 
2025-05-07T20:26:35.7870794Z 
2025-05-07T20:26:35.7870799Z 
2025-05-07T20:26:35.7870804Z 
2025-05-07T20:26:35.7871478Z [A[A[A[A[A[A
2025-05-07T20:26:35.7871719Z 
2025-05-07T20:26:35.7871736Z 
2025-05-07T20:26:35.7871742Z 
2025-05-07T20:26:35.7871748Z 
2025-05-07T20:26:35.7871754Z 
2025-05-07T20:26:35.7871765Z 
2025-05-07T20:26:35.7871770Z 
2025-05-07T20:26:35.7872181Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7872438Z 
2025-05-07T20:26:35.7872444Z 
2025-05-07T20:26:35.7872450Z 
2025-05-07T20:26:35.7872456Z 
2025-05-07T20:26:35.7872461Z 
2025-05-07T20:26:35.7872467Z 
2025-05-07T20:26:35.7872473Z 
2025-05-07T20:26:35.7872496Z 
2025-05-07T20:26:35.7873019Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7873252Z 
2025-05-07T20:26:35.7873257Z 
2025-05-07T20:26:35.7873263Z 
2025-05-07T20:26:35.7873284Z 
2025-05-07T20:26:35.7873289Z 
2025-05-07T20:26:35.7873295Z 
2025-05-07T20:26:35.7873300Z 
2025-05-07T20:26:35.7873305Z 
2025-05-07T20:26:35.7873310Z 
2025-05-07T20:26:35.7873843Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7874024Z 
2025-05-07T20:26:35.7874028Z 
2025-05-07T20:26:35.7874032Z 
2025-05-07T20:26:35.7874042Z 
2025-05-07T20:26:35.7874046Z 
2025-05-07T20:26:35.7874057Z 
2025-05-07T20:26:35.7874060Z 
2025-05-07T20:26:35.7874064Z 
2025-05-07T20:26:35.7874068Z 
2025-05-07T20:26:35.7874071Z 
2025-05-07T20:26:35.7874525Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7874772Z 
2025-05-07T20:26:35.7874777Z 
2025-05-07T20:26:35.7874783Z 
2025-05-07T20:26:35.7874788Z 
2025-05-07T20:26:35.7874794Z 
2025-05-07T20:26:35.7874799Z 
2025-05-07T20:26:35.7874809Z 
2025-05-07T20:26:35.7874814Z 
2025-05-07T20:26:35.7874981Z 
2025-05-07T20:26:35.7874988Z 
2025-05-07T20:26:35.7874994Z 
2025-05-07T20:26:35.7875220Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7875489Z 
2025-05-07T20:26:35.7875494Z 
2025-05-07T20:26:35.7875625Z 
2025-05-07T20:26:35.7875631Z 
2025-05-07T20:26:35.7875636Z 
2025-05-07T20:26:35.7875641Z 
2025-05-07T20:26:35.7875657Z 
2025-05-07T20:26:35.7875662Z 
2025-05-07T20:26:35.7875667Z 
2025-05-07T20:26:35.7875673Z 
2025-05-07T20:26:35.7875678Z 
2025-05-07T20:26:35.7875683Z 
2025-05-07T20:26:35.7876232Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7876509Z 
2025-05-07T20:26:35.7876516Z 
2025-05-07T20:26:35.7876521Z 
2025-05-07T20:26:35.7876527Z 
2025-05-07T20:26:35.7876541Z 
2025-05-07T20:26:35.7876554Z 
2025-05-07T20:26:35.7876558Z 
2025-05-07T20:26:35.7876561Z 
2025-05-07T20:26:35.7876565Z 
2025-05-07T20:26:35.7876569Z 
2025-05-07T20:26:35.7876572Z 
2025-05-07T20:26:35.7876576Z 
2025-05-07T20:26:35.7876580Z 
2025-05-07T20:26:35.7876736Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7876951Z 
2025-05-07T20:26:35.7876955Z 
2025-05-07T20:26:35.7876959Z 
2025-05-07T20:26:35.7876963Z 
2025-05-07T20:26:35.7876966Z 
2025-05-07T20:26:35.7876970Z 
2025-05-07T20:26:35.7876979Z 
2025-05-07T20:26:35.7876983Z 
2025-05-07T20:26:35.7876986Z 
2025-05-07T20:26:35.7876992Z 
2025-05-07T20:26:35.7876997Z 
2025-05-07T20:26:35.7877003Z 
2025-05-07T20:26:35.7877009Z 
2025-05-07T20:26:35.7877014Z 
2025-05-07T20:26:35.7877458Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7877805Z 
2025-05-07T20:26:35.7877812Z 
2025-05-07T20:26:35.7877817Z 
2025-05-07T20:26:35.7877823Z 
2025-05-07T20:26:35.7877828Z 
2025-05-07T20:26:35.7877834Z 
2025-05-07T20:26:35.7877840Z 
2025-05-07T20:26:35.7877846Z 
2025-05-07T20:26:35.7877852Z 
2025-05-07T20:26:35.7877857Z 
2025-05-07T20:26:35.7877862Z 
2025-05-07T20:26:35.7877868Z 
2025-05-07T20:26:35.7877889Z 
2025-05-07T20:26:35.7877895Z 
2025-05-07T20:26:35.7877901Z 
2025-05-07T20:26:35.7878187Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7878538Z 
2025-05-07T20:26:35.7878544Z 
2025-05-07T20:26:35.7878550Z 
2025-05-07T20:26:35.7878555Z 
2025-05-07T20:26:35.7878560Z 
2025-05-07T20:26:35.7878565Z 
2025-05-07T20:26:35.7878579Z 
2025-05-07T20:26:35.7878585Z 
2025-05-07T20:26:35.7878590Z 
2025-05-07T20:26:35.7878596Z 
2025-05-07T20:26:35.7878616Z 
2025-05-07T20:26:35.7878622Z 
2025-05-07T20:26:35.7878627Z 
2025-05-07T20:26:35.7878633Z 
2025-05-07T20:26:35.7878639Z 
2025-05-07T20:26:35.7878653Z 
2025-05-07T20:26:35.7878926Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7879290Z 
2025-05-07T20:26:35.7879296Z 
2025-05-07T20:26:35.7879301Z 
2025-05-07T20:26:35.7879306Z 
2025-05-07T20:26:35.7879312Z 
2025-05-07T20:26:35.7879317Z 
2025-05-07T20:26:35.7879322Z 
2025-05-07T20:26:35.7879328Z 
2025-05-07T20:26:35.7879341Z 
2025-05-07T20:26:35.7879346Z 
2025-05-07T20:26:35.7879352Z 
2025-05-07T20:26:35.7879357Z 
2025-05-07T20:26:35.7879363Z 
2025-05-07T20:26:35.7879369Z 
2025-05-07T20:26:35.7879374Z 
2025-05-07T20:26:35.7879380Z 
2025-05-07T20:26:35.7879393Z 
2025-05-07T20:26:35.7879724Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7880078Z 
2025-05-07T20:26:35.7880084Z 
2025-05-07T20:26:35.7880097Z 
2025-05-07T20:26:35.7880102Z 
2025-05-07T20:26:35.7880107Z 
2025-05-07T20:26:35.7880112Z 
2025-05-07T20:26:35.7880250Z 
2025-05-07T20:26:35.7880256Z 
2025-05-07T20:26:35.7880261Z 
2025-05-07T20:26:35.7880278Z 
2025-05-07T20:26:35.7880284Z 
2025-05-07T20:26:35.7880290Z 
2025-05-07T20:26:35.7880296Z 
2025-05-07T20:26:35.7880301Z 
2025-05-07T20:26:35.7880307Z 
2025-05-07T20:26:35.7880313Z 
2025-05-07T20:26:35.7880318Z 
2025-05-07T20:26:35.7880324Z 
2025-05-07T20:26:35.7881518Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7881869Z 
2025-05-07T20:26:35.7881881Z 
2025-05-07T20:26:35.7882056Z [A
2025-05-07T20:26:35.7882231Z 
2025-05-07T20:26:35.7882241Z 
2025-05-07T20:26:35.7882929Z [A[A
2025-05-07T20:26:35.7883112Z 
2025-05-07T20:26:35.7883118Z 
2025-05-07T20:26:35.7883128Z 
2025-05-07T20:26:35.7884334Z [A[A[A
2025-05-07T20:26:35.7884515Z 
2025-05-07T20:26:35.7884539Z 
2025-05-07T20:26:35.7884544Z 
2025-05-07T20:26:35.7884549Z 
2025-05-07T20:26:35.7884710Z [A[A[A[A
2025-05-07T20:26:35.7885016Z 
2025-05-07T20:26:35.7885022Z 
2025-05-07T20:26:35.7885026Z 
2025-05-07T20:26:35.7885039Z 
2025-05-07T20:26:35.7885044Z 
2025-05-07T20:26:35.7885209Z [A[A[A[A[A
2025-05-07T20:26:35.7885384Z 
2025-05-07T20:26:35.7885390Z 
2025-05-07T20:26:35.7885395Z 
2025-05-07T20:26:35.7885400Z 
2025-05-07T20:26:35.7885405Z 
2025-05-07T20:26:35.7885420Z 
2025-05-07T20:26:35.7885588Z [A[A[A[A[A[A
2025-05-07T20:26:35.7885771Z 
2025-05-07T20:26:35.7885776Z 
2025-05-07T20:26:35.7885781Z 
2025-05-07T20:26:35.7885787Z 
2025-05-07T20:26:35.7885792Z 
2025-05-07T20:26:35.7885804Z 
2025-05-07T20:26:35.7885810Z 
2025-05-07T20:26:35.7886226Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7886385Z 
2025-05-07T20:26:35.7886389Z 
2025-05-07T20:26:35.7886393Z 
2025-05-07T20:26:35.7886396Z 
2025-05-07T20:26:35.7886413Z 
2025-05-07T20:26:35.7886429Z 
2025-05-07T20:26:35.7886432Z 
2025-05-07T20:26:35.7886439Z 
2025-05-07T20:26:35.7886764Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7886975Z 
2025-05-07T20:26:35.7886986Z 
2025-05-07T20:26:35.7886997Z 
2025-05-07T20:26:35.7887005Z 
2025-05-07T20:26:35.7887009Z 
2025-05-07T20:26:35.7887013Z 
2025-05-07T20:26:35.7887016Z 
2025-05-07T20:26:35.7887020Z 
2025-05-07T20:26:35.7887023Z 
2025-05-07T20:26:35.7887552Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7887827Z 
2025-05-07T20:26:35.7887832Z 
2025-05-07T20:26:35.7887838Z 
2025-05-07T20:26:35.7887843Z 
2025-05-07T20:26:35.7887848Z 
2025-05-07T20:26:35.7887853Z 
2025-05-07T20:26:35.7887858Z 
2025-05-07T20:26:35.7887863Z 
2025-05-07T20:26:35.7887869Z 
2025-05-07T20:26:35.7887888Z 
2025-05-07T20:26:35.7888340Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7888593Z 
2025-05-07T20:26:35.7888599Z 
2025-05-07T20:26:35.7888605Z 
2025-05-07T20:26:35.7888610Z 
2025-05-07T20:26:35.7888615Z 
2025-05-07T20:26:35.7888628Z 
2025-05-07T20:26:35.7888644Z 
2025-05-07T20:26:35.7888649Z 
2025-05-07T20:26:35.7888654Z 
2025-05-07T20:26:35.7888660Z 
2025-05-07T20:26:35.7888665Z 
2025-05-07T20:26:35.7888926Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7889196Z 
2025-05-07T20:26:35.7889201Z 
2025-05-07T20:26:35.7889207Z 
2025-05-07T20:26:35.7889212Z 
2025-05-07T20:26:35.7889234Z 
2025-05-07T20:26:35.7889239Z 
2025-05-07T20:26:35.7889245Z 
2025-05-07T20:26:35.7889250Z 
2025-05-07T20:26:35.7889255Z 
2025-05-07T20:26:35.7889260Z 
2025-05-07T20:26:35.7889265Z 
2025-05-07T20:26:35.7889271Z 
2025-05-07T20:26:35.7889511Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7889786Z 
2025-05-07T20:26:35.7889799Z 
2025-05-07T20:26:35.7889805Z 
2025-05-07T20:26:35.7889810Z 
2025-05-07T20:26:35.7889815Z 
2025-05-07T20:26:35.7889826Z 
2025-05-07T20:26:35.7889831Z 
2025-05-07T20:26:35.7889837Z 
2025-05-07T20:26:35.7889842Z 
2025-05-07T20:26:35.7889847Z 
2025-05-07T20:26:35.7889852Z 
2025-05-07T20:26:35.7889857Z 
2025-05-07T20:26:35.7889862Z 
2025-05-07T20:26:35.7890270Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7890586Z 
2025-05-07T20:26:35.7890593Z 
2025-05-07T20:26:35.7890599Z 
2025-05-07T20:26:35.7890604Z 
2025-05-07T20:26:35.7890629Z 
2025-05-07T20:26:35.7890645Z 
2025-05-07T20:26:35.7890651Z 
2025-05-07T20:26:35.7890656Z 
2025-05-07T20:26:35.7890662Z 
2025-05-07T20:26:35.7890667Z 
2025-05-07T20:26:35.7890673Z 
2025-05-07T20:26:35.7890678Z 
2025-05-07T20:26:35.7890684Z 
2025-05-07T20:26:35.7890690Z 
2025-05-07T20:26:35.7890943Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7891270Z 
2025-05-07T20:26:35.7891276Z 
2025-05-07T20:26:35.7891282Z 
2025-05-07T20:26:35.7891288Z 
2025-05-07T20:26:35.7891294Z 
2025-05-07T20:26:35.7891299Z 
2025-05-07T20:26:35.7891313Z 
2025-05-07T20:26:35.7891319Z 
2025-05-07T20:26:35.7891325Z 
2025-05-07T20:26:35.7891331Z 
2025-05-07T20:26:35.7891337Z 
2025-05-07T20:26:35.7891342Z 
2025-05-07T20:26:35.7891348Z 
2025-05-07T20:26:35.7891353Z 
2025-05-07T20:26:35.7891359Z 
2025-05-07T20:26:35.7891781Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7892108Z 
2025-05-07T20:26:35.7892114Z 
2025-05-07T20:26:35.7892120Z 
2025-05-07T20:26:35.7892127Z 
2025-05-07T20:26:35.7892238Z 
2025-05-07T20:26:35.7892244Z 
2025-05-07T20:26:35.7892250Z 
2025-05-07T20:26:35.7892256Z 
2025-05-07T20:26:35.7892262Z 
2025-05-07T20:26:35.7892278Z 
2025-05-07T20:26:35.7892284Z 
2025-05-07T20:26:35.7892290Z 
2025-05-07T20:26:35.7892296Z 
2025-05-07T20:26:35.7892302Z 
2025-05-07T20:26:35.7892308Z 
2025-05-07T20:26:35.7892326Z 
2025-05-07T20:26:35.7892589Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7892936Z 
2025-05-07T20:26:35.7892942Z 
2025-05-07T20:26:35.7892948Z 
2025-05-07T20:26:35.7892954Z 
2025-05-07T20:26:35.7892960Z 
2025-05-07T20:26:35.7892966Z 
2025-05-07T20:26:35.7892972Z 
2025-05-07T20:26:35.7892978Z 
2025-05-07T20:26:35.7892984Z 
2025-05-07T20:26:35.7892990Z 
2025-05-07T20:26:35.7892995Z 
2025-05-07T20:26:35.7893001Z 
2025-05-07T20:26:35.7893007Z 
2025-05-07T20:26:35.7893022Z 
2025-05-07T20:26:35.7893028Z 
2025-05-07T20:26:35.7893034Z 
2025-05-07T20:26:35.7893040Z 
2025-05-07T20:26:35.7893321Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7893689Z 
2025-05-07T20:26:35.7893695Z 
2025-05-07T20:26:35.7893701Z 
2025-05-07T20:26:35.7893707Z 
2025-05-07T20:26:35.7893713Z 
2025-05-07T20:26:35.7893718Z 
2025-05-07T20:26:35.7893724Z 
2025-05-07T20:26:35.7893729Z 
2025-05-07T20:26:35.7893735Z 
2025-05-07T20:26:35.7893749Z 
2025-05-07T20:26:35.7893756Z 
2025-05-07T20:26:35.7893761Z 
2025-05-07T20:26:35.7893767Z 
2025-05-07T20:26:35.7893773Z 
2025-05-07T20:26:35.7893778Z 
2025-05-07T20:26:35.7893784Z 
2025-05-07T20:26:35.7893790Z 
2025-05-07T20:26:35.7893796Z 
2025-05-07T20:26:35.7894499Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7894855Z 
2025-05-07T20:26:35.7894866Z 
2025-05-07T20:26:35.7895036Z [A
2025-05-07T20:26:35.7895205Z 
2025-05-07T20:26:35.7895211Z 
2025-05-07T20:26:35.7895624Z [A[A
2025-05-07T20:26:35.7895802Z 
2025-05-07T20:26:35.7895816Z 
2025-05-07T20:26:35.7895825Z 
2025-05-07T20:26:35.7896205Z [A[A[A
2025-05-07T20:26:35.7896380Z 
2025-05-07T20:26:35.7896386Z 
2025-05-07T20:26:35.7896391Z 
2025-05-07T20:26:35.7896412Z 
2025-05-07T20:26:35.7896734Z [A[A[A[A
2025-05-07T20:26:35.7896908Z 
2025-05-07T20:26:35.7896921Z 
2025-05-07T20:26:35.7896927Z 
2025-05-07T20:26:35.7896932Z 
2025-05-07T20:26:35.7896941Z 
2025-05-07T20:26:35.7897396Z [A[A[A[A[A
2025-05-07T20:26:35.7897611Z 
2025-05-07T20:26:35.7897617Z 
2025-05-07T20:26:35.7897623Z 
2025-05-07T20:26:35.7897629Z 
2025-05-07T20:26:35.7897635Z 
2025-05-07T20:26:35.7897640Z 
2025-05-07T20:26:35.7898014Z [A[A[A[A[A[A
2025-05-07T20:26:35.7898231Z 
2025-05-07T20:26:35.7898237Z 
2025-05-07T20:26:35.7898243Z 
2025-05-07T20:26:35.7898249Z 
2025-05-07T20:26:35.7898255Z 
2025-05-07T20:26:35.7898261Z 
2025-05-07T20:26:35.7898271Z 
2025-05-07T20:26:35.7898667Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7898906Z 
2025-05-07T20:26:35.7898912Z 
2025-05-07T20:26:35.7898927Z 
2025-05-07T20:26:35.7898933Z 
2025-05-07T20:26:35.7898939Z 
2025-05-07T20:26:35.7898944Z 
2025-05-07T20:26:35.7898964Z 
2025-05-07T20:26:35.7898969Z 
2025-05-07T20:26:35.7899355Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7899597Z 
2025-05-07T20:26:35.7899602Z 
2025-05-07T20:26:35.7899607Z 
2025-05-07T20:26:35.7899612Z 
2025-05-07T20:26:35.7899618Z 
2025-05-07T20:26:35.7899623Z 
2025-05-07T20:26:35.7899628Z 
2025-05-07T20:26:35.7899633Z 
2025-05-07T20:26:35.7899641Z 
2025-05-07T20:26:35.7899908Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7900138Z 
2025-05-07T20:26:35.7900143Z 
2025-05-07T20:26:35.7900148Z 
2025-05-07T20:26:35.7900153Z 
2025-05-07T20:26:35.7900158Z 
2025-05-07T20:26:35.7900169Z 
2025-05-07T20:26:35.7900174Z 
2025-05-07T20:26:35.7900179Z 
2025-05-07T20:26:35.7900184Z 
2025-05-07T20:26:35.7900189Z 
2025-05-07T20:26:35.7900532Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7900778Z 
2025-05-07T20:26:35.7900792Z 
2025-05-07T20:26:35.7900797Z 
2025-05-07T20:26:35.7900803Z 
2025-05-07T20:26:35.7900939Z 
2025-05-07T20:26:35.7900946Z 
2025-05-07T20:26:35.7900952Z 
2025-05-07T20:26:35.7900957Z 
2025-05-07T20:26:35.7900962Z 
2025-05-07T20:26:35.7900967Z 
2025-05-07T20:26:35.7901066Z 
2025-05-07T20:26:35.7901270Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7901535Z 
2025-05-07T20:26:35.7901541Z 
2025-05-07T20:26:35.7901546Z 
2025-05-07T20:26:35.7901551Z 
2025-05-07T20:26:35.7901557Z 
2025-05-07T20:26:35.7901562Z 
2025-05-07T20:26:35.7901567Z 
2025-05-07T20:26:35.7901572Z 
2025-05-07T20:26:35.7901576Z 
2025-05-07T20:26:35.7901581Z 
2025-05-07T20:26:35.7901586Z 
2025-05-07T20:26:35.7901591Z 
2025-05-07T20:26:35.7901794Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7902063Z 
2025-05-07T20:26:35.7902069Z 
2025-05-07T20:26:35.7902074Z 
2025-05-07T20:26:35.7902079Z 
2025-05-07T20:26:35.7902084Z 
2025-05-07T20:26:35.7902089Z 
2025-05-07T20:26:35.7902094Z 
2025-05-07T20:26:35.7902100Z 
2025-05-07T20:26:35.7902105Z 
2025-05-07T20:26:35.7902111Z 
2025-05-07T20:26:35.7902116Z 
2025-05-07T20:26:35.7902129Z 
2025-05-07T20:26:35.7902134Z 
2025-05-07T20:26:35.7902343Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7902616Z 
2025-05-07T20:26:35.7902621Z 
2025-05-07T20:26:35.7902634Z 
2025-05-07T20:26:35.7902639Z 
2025-05-07T20:26:35.7902644Z 
2025-05-07T20:26:35.7902650Z 
2025-05-07T20:26:35.7902655Z 
2025-05-07T20:26:35.7902669Z 
2025-05-07T20:26:35.7902674Z 
2025-05-07T20:26:35.7902678Z 
2025-05-07T20:26:35.7902683Z 
2025-05-07T20:26:35.7902688Z 
2025-05-07T20:26:35.7902694Z 
2025-05-07T20:26:35.7902699Z 
2025-05-07T20:26:35.7902908Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7903204Z 
2025-05-07T20:26:35.7903209Z 
2025-05-07T20:26:35.7903214Z 
2025-05-07T20:26:35.7903219Z 
2025-05-07T20:26:35.7903224Z 
2025-05-07T20:26:35.7903228Z 
2025-05-07T20:26:35.7903233Z 
2025-05-07T20:26:35.7903238Z 
2025-05-07T20:26:35.7903243Z 
2025-05-07T20:26:35.7903248Z 
2025-05-07T20:26:35.7903254Z 
2025-05-07T20:26:35.7903259Z 
2025-05-07T20:26:35.7903264Z 
2025-05-07T20:26:35.7903275Z 
2025-05-07T20:26:35.7903280Z 
2025-05-07T20:26:35.7903504Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7903794Z 
2025-05-07T20:26:35.7903799Z 
2025-05-07T20:26:35.7903811Z 
2025-05-07T20:26:35.7903816Z 
2025-05-07T20:26:35.7903821Z 
2025-05-07T20:26:35.7903826Z 
2025-05-07T20:26:35.7903831Z 
2025-05-07T20:26:35.7903836Z 
2025-05-07T20:26:35.7903842Z 
2025-05-07T20:26:35.7903847Z 
2025-05-07T20:26:35.7903852Z 
2025-05-07T20:26:35.7903857Z 
2025-05-07T20:26:35.7903870Z 
2025-05-07T20:26:35.7903875Z 
2025-05-07T20:26:35.7903880Z 
2025-05-07T20:26:35.7903885Z 
2025-05-07T20:26:35.7904107Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7904409Z 
2025-05-07T20:26:35.7904423Z 
2025-05-07T20:26:35.7904428Z 
2025-05-07T20:26:35.7904433Z 
2025-05-07T20:26:35.7904438Z 
2025-05-07T20:26:35.7904443Z 
2025-05-07T20:26:35.7904448Z 
2025-05-07T20:26:35.7904454Z 
2025-05-07T20:26:35.7904459Z 
2025-05-07T20:26:35.7904464Z 
2025-05-07T20:26:35.7904469Z 
2025-05-07T20:26:35.7904480Z 
2025-05-07T20:26:35.7904485Z 
2025-05-07T20:26:35.7904490Z 
2025-05-07T20:26:35.7904495Z 
2025-05-07T20:26:35.7904500Z 
2025-05-07T20:26:35.7904505Z 
2025-05-07T20:26:35.7904748Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7905049Z 
2025-05-07T20:26:35.7905054Z 
2025-05-07T20:26:35.7905059Z 
2025-05-07T20:26:35.7905065Z 
2025-05-07T20:26:35.7905069Z 
2025-05-07T20:26:35.7905074Z 
2025-05-07T20:26:35.7905080Z 
2025-05-07T20:26:35.7905085Z 
2025-05-07T20:26:35.7905090Z 
2025-05-07T20:26:35.7905095Z 
2025-05-07T20:26:35.7905100Z 
2025-05-07T20:26:35.7905105Z 
2025-05-07T20:26:35.7905118Z 
2025-05-07T20:26:35.7905123Z 
2025-05-07T20:26:35.7905128Z 
2025-05-07T20:26:35.7905133Z 
2025-05-07T20:26:35.7905138Z 
2025-05-07T20:26:35.7905143Z 
2025-05-07T20:26:35.7905506Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7905823Z 
2025-05-07T20:26:35.7905847Z 
2025-05-07T20:26:35.7906000Z [A
2025-05-07T20:26:35.7906140Z 
2025-05-07T20:26:35.7906144Z 
2025-05-07T20:26:35.7906663Z [A[A
2025-05-07T20:26:35.7906788Z 
2025-05-07T20:26:35.7906792Z 
2025-05-07T20:26:35.7906802Z 
2025-05-07T20:26:35.7906956Z [A[A[A
2025-05-07T20:26:35.7907185Z 
2025-05-07T20:26:35.7907189Z 
2025-05-07T20:26:35.7907192Z 
2025-05-07T20:26:35.7907199Z 
2025-05-07T20:26:35.7907488Z [A[A[A[A
2025-05-07T20:26:35.7907668Z 
2025-05-07T20:26:35.7907677Z 
2025-05-07T20:26:35.7907683Z 
2025-05-07T20:26:35.7907688Z 
2025-05-07T20:26:35.7907693Z 
2025-05-07T20:26:35.7908054Z [A[A[A[A[A
2025-05-07T20:26:35.7908231Z 
2025-05-07T20:26:35.7908237Z 
2025-05-07T20:26:35.7908246Z 
2025-05-07T20:26:35.7908251Z 
2025-05-07T20:26:35.7908256Z 
2025-05-07T20:26:35.7908262Z 
2025-05-07T20:26:35.7908663Z [A[A[A[A[A[A
2025-05-07T20:26:35.7908859Z 
2025-05-07T20:26:35.7908864Z 
2025-05-07T20:26:35.7908870Z 
2025-05-07T20:26:35.7908875Z 
2025-05-07T20:26:35.7908880Z 
2025-05-07T20:26:35.7908886Z 
2025-05-07T20:26:35.7908902Z 
2025-05-07T20:26:35.7909075Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7909290Z 
2025-05-07T20:26:35.7909296Z 
2025-05-07T20:26:35.7909301Z 
2025-05-07T20:26:35.7909306Z 
2025-05-07T20:26:35.7909315Z 
2025-05-07T20:26:35.7909320Z 
2025-05-07T20:26:35.7909334Z 
2025-05-07T20:26:35.7909339Z 
2025-05-07T20:26:35.7909655Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7909874Z 
2025-05-07T20:26:35.7909879Z 
2025-05-07T20:26:35.7909885Z 
2025-05-07T20:26:35.7909890Z 
2025-05-07T20:26:35.7909895Z 
2025-05-07T20:26:35.7909901Z 
2025-05-07T20:26:35.7909906Z 
2025-05-07T20:26:35.7909911Z 
2025-05-07T20:26:35.7909932Z 
2025-05-07T20:26:35.7910104Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7910327Z 
2025-05-07T20:26:35.7910333Z 
2025-05-07T20:26:35.7910338Z 
2025-05-07T20:26:35.7910343Z 
2025-05-07T20:26:35.7910348Z 
2025-05-07T20:26:35.7910353Z 
2025-05-07T20:26:35.7910359Z 
2025-05-07T20:26:35.7910371Z 
2025-05-07T20:26:35.7910384Z 
2025-05-07T20:26:35.7910390Z 
2025-05-07T20:26:35.7910574Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7910809Z 
2025-05-07T20:26:35.7910822Z 
2025-05-07T20:26:35.7910828Z 
2025-05-07T20:26:35.7910840Z 
2025-05-07T20:26:35.7910845Z 
2025-05-07T20:26:35.7910851Z 
2025-05-07T20:26:35.7910856Z 
2025-05-07T20:26:35.7910861Z 
2025-05-07T20:26:35.7910873Z 
2025-05-07T20:26:35.7910877Z 
2025-05-07T20:26:35.7910888Z 
2025-05-07T20:26:35.7911072Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7911330Z 
2025-05-07T20:26:35.7911335Z 
2025-05-07T20:26:35.7911340Z 
2025-05-07T20:26:35.7911345Z 
2025-05-07T20:26:35.7911350Z 
2025-05-07T20:26:35.7911355Z 
2025-05-07T20:26:35.7911360Z 
2025-05-07T20:26:35.7911364Z 
2025-05-07T20:26:35.7911369Z 
2025-05-07T20:26:35.7911374Z 
2025-05-07T20:26:35.7911386Z 
2025-05-07T20:26:35.7911391Z 
2025-05-07T20:26:35.7911575Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7911841Z 
2025-05-07T20:26:35.7911846Z 
2025-05-07T20:26:35.7911851Z 
2025-05-07T20:26:35.7911855Z 
2025-05-07T20:26:35.7911860Z 
2025-05-07T20:26:35.7911865Z 
2025-05-07T20:26:35.7911870Z 
2025-05-07T20:26:35.7911876Z 
2025-05-07T20:26:35.7911887Z 
2025-05-07T20:26:35.7911892Z 
2025-05-07T20:26:35.7911897Z 
2025-05-07T20:26:35.7911903Z 
2025-05-07T20:26:35.7911908Z 
2025-05-07T20:26:35.7912110Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7912389Z 
2025-05-07T20:26:35.7912395Z 
2025-05-07T20:26:35.7912400Z 
2025-05-07T20:26:35.7912405Z 
2025-05-07T20:26:35.7912410Z 
2025-05-07T20:26:35.7912416Z 
2025-05-07T20:26:35.7912421Z 
2025-05-07T20:26:35.7912426Z 
2025-05-07T20:26:35.7912440Z 
2025-05-07T20:26:35.7912446Z 
2025-05-07T20:26:35.7912450Z 
2025-05-07T20:26:35.7912455Z 
2025-05-07T20:26:35.7912460Z 
2025-05-07T20:26:35.7912473Z 
2025-05-07T20:26:35.7912676Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7912964Z 
2025-05-07T20:26:35.7912970Z 
2025-05-07T20:26:35.7912974Z 
2025-05-07T20:26:35.7912979Z 
2025-05-07T20:26:35.7912984Z 
2025-05-07T20:26:35.7912989Z 
2025-05-07T20:26:35.7912994Z 
2025-05-07T20:26:35.7912999Z 
2025-05-07T20:26:35.7913004Z 
2025-05-07T20:26:35.7913009Z 
2025-05-07T20:26:35.7913014Z 
2025-05-07T20:26:35.7913137Z 
2025-05-07T20:26:35.7913143Z 
2025-05-07T20:26:35.7913148Z 
2025-05-07T20:26:35.7913153Z 
2025-05-07T20:26:35.7913686Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7914168Z 
2025-05-07T20:26:35.7914174Z 
2025-05-07T20:26:35.7914179Z 
2025-05-07T20:26:35.7914184Z 
2025-05-07T20:26:35.7914190Z 
2025-05-07T20:26:35.7914195Z 
2025-05-07T20:26:35.7914200Z 
2025-05-07T20:26:35.7914205Z 
2025-05-07T20:26:35.7914210Z 
2025-05-07T20:26:35.7914216Z 
2025-05-07T20:26:35.7914221Z 
2025-05-07T20:26:35.7914226Z 
2025-05-07T20:26:35.7914232Z 
2025-05-07T20:26:35.7914249Z 
2025-05-07T20:26:35.7914254Z 
2025-05-07T20:26:35.7914260Z 
2025-05-07T20:26:35.7914506Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7914813Z 
2025-05-07T20:26:35.7914818Z 
2025-05-07T20:26:35.7914823Z 
2025-05-07T20:26:35.7914829Z 
2025-05-07T20:26:35.7914834Z 
2025-05-07T20:26:35.7914839Z 
2025-05-07T20:26:35.7914844Z 
2025-05-07T20:26:35.7914849Z 
2025-05-07T20:26:35.7914855Z 
2025-05-07T20:26:35.7914869Z 
2025-05-07T20:26:35.7914874Z 
2025-05-07T20:26:35.7914878Z 
2025-05-07T20:26:35.7914884Z 
2025-05-07T20:26:35.7914889Z 
2025-05-07T20:26:35.7914894Z 
2025-05-07T20:26:35.7914906Z 
2025-05-07T20:26:35.7914911Z 
2025-05-07T20:26:35.7915136Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7915433Z 
2025-05-07T20:26:35.7915438Z 
2025-05-07T20:26:35.7915443Z 
2025-05-07T20:26:35.7915449Z 
2025-05-07T20:26:35.7915454Z 
2025-05-07T20:26:35.7915459Z 
2025-05-07T20:26:35.7915464Z 
2025-05-07T20:26:35.7915469Z 
2025-05-07T20:26:35.7915474Z 
2025-05-07T20:26:35.7915480Z 
2025-05-07T20:26:35.7915485Z 
2025-05-07T20:26:35.7915490Z 
2025-05-07T20:26:35.7915495Z 
2025-05-07T20:26:35.7915500Z 
2025-05-07T20:26:35.7915512Z 
2025-05-07T20:26:35.7915517Z 
2025-05-07T20:26:35.7915523Z 
2025-05-07T20:26:35.7915528Z 
2025-05-07T20:26:35.7915761Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7916063Z 
2025-05-07T20:26:35.7916068Z 
2025-05-07T20:26:35.7916232Z [A
2025-05-07T20:26:35.7916385Z 
2025-05-07T20:26:35.7916391Z 
2025-05-07T20:26:35.7916549Z [A[A
2025-05-07T20:26:35.7916704Z 
2025-05-07T20:26:35.7916709Z 
2025-05-07T20:26:35.7916722Z 
2025-05-07T20:26:35.7916871Z [A[A[A
2025-05-07T20:26:35.7917031Z 
2025-05-07T20:26:35.7917037Z 
2025-05-07T20:26:35.7917042Z 
2025-05-07T20:26:35.7917047Z 
2025-05-07T20:26:35.7917204Z [A[A[A[A
2025-05-07T20:26:35.7917380Z 
2025-05-07T20:26:35.7917385Z 
2025-05-07T20:26:35.7917391Z 
2025-05-07T20:26:35.7917396Z 
2025-05-07T20:26:35.7917401Z 
2025-05-07T20:26:35.7917560Z [A[A[A[A[A
2025-05-07T20:26:35.7917743Z 
2025-05-07T20:26:35.7917748Z 
2025-05-07T20:26:35.7917753Z 
2025-05-07T20:26:35.7917758Z 
2025-05-07T20:26:35.7917763Z 
2025-05-07T20:26:35.7917769Z 
2025-05-07T20:26:35.7917930Z [A[A[A[A[A[A
2025-05-07T20:26:35.7918118Z 
2025-05-07T20:26:35.7918123Z 
2025-05-07T20:26:35.7918128Z 
2025-05-07T20:26:35.7918133Z 
2025-05-07T20:26:35.7918138Z 
2025-05-07T20:26:35.7918143Z 
2025-05-07T20:26:35.7918154Z 
2025-05-07T20:26:35.7918321Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7918528Z 
2025-05-07T20:26:35.7918533Z 
2025-05-07T20:26:35.7918538Z 
2025-05-07T20:26:35.7918544Z 
2025-05-07T20:26:35.7918555Z 
2025-05-07T20:26:35.7918560Z 
2025-05-07T20:26:35.7918565Z 
2025-05-07T20:26:35.7918570Z 
2025-05-07T20:26:35.7918747Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7918969Z 
2025-05-07T20:26:35.7918975Z 
2025-05-07T20:26:35.7918980Z 
2025-05-07T20:26:35.7918985Z 
2025-05-07T20:26:35.7918990Z 
2025-05-07T20:26:35.7918996Z 
2025-05-07T20:26:35.7919001Z 
2025-05-07T20:26:35.7919014Z 
2025-05-07T20:26:35.7919019Z 
2025-05-07T20:26:35.7919193Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7919425Z 
2025-05-07T20:26:35.7919430Z 
2025-05-07T20:26:35.7919436Z 
2025-05-07T20:26:35.7919441Z 
2025-05-07T20:26:35.7919446Z 
2025-05-07T20:26:35.7919451Z 
2025-05-07T20:26:35.7919456Z 
2025-05-07T20:26:35.7919461Z 
2025-05-07T20:26:35.7919466Z 
2025-05-07T20:26:35.7919470Z 
2025-05-07T20:26:35.7919825Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7920069Z 
2025-05-07T20:26:35.7920074Z 
2025-05-07T20:26:35.7920079Z 
2025-05-07T20:26:35.7920084Z 
2025-05-07T20:26:35.7920089Z 
2025-05-07T20:26:35.7920331Z 
2025-05-07T20:26:35.7920336Z 
2025-05-07T20:26:35.7920341Z 
2025-05-07T20:26:35.7920346Z 
2025-05-07T20:26:35.7920352Z 
2025-05-07T20:26:35.7920357Z 
2025-05-07T20:26:35.7920588Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7920840Z 
2025-05-07T20:26:35.7920845Z 
2025-05-07T20:26:35.7920850Z 
2025-05-07T20:26:35.7920855Z 
2025-05-07T20:26:35.7920860Z 
2025-05-07T20:26:35.7920865Z 
2025-05-07T20:26:35.7920870Z 
2025-05-07T20:26:35.7920883Z 
2025-05-07T20:26:35.7920889Z 
2025-05-07T20:26:35.7920894Z 
2025-05-07T20:26:35.7920899Z 
2025-05-07T20:26:35.7920904Z 
2025-05-07T20:26:35.7921093Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7921376Z 
2025-05-07T20:26:35.7921383Z 
2025-05-07T20:26:35.7921389Z 
2025-05-07T20:26:35.7921404Z 
2025-05-07T20:26:35.7921409Z 
2025-05-07T20:26:35.7921414Z 
2025-05-07T20:26:35.7921428Z 
2025-05-07T20:26:35.7921433Z 
2025-05-07T20:26:35.7921438Z 
2025-05-07T20:26:35.7921443Z 
2025-05-07T20:26:35.7921448Z 
2025-05-07T20:26:35.7921454Z 
2025-05-07T20:26:35.7921467Z 
2025-05-07T20:26:35.7921697Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7921974Z 
2025-05-07T20:26:35.7921979Z 
2025-05-07T20:26:35.7921984Z 
2025-05-07T20:26:35.7921989Z 
2025-05-07T20:26:35.7921993Z 
2025-05-07T20:26:35.7921998Z 
2025-05-07T20:26:35.7922003Z 
2025-05-07T20:26:35.7922008Z 
2025-05-07T20:26:35.7922013Z 
2025-05-07T20:26:35.7922019Z 
2025-05-07T20:26:35.7922023Z 
2025-05-07T20:26:35.7922029Z 
2025-05-07T20:26:35.7922034Z 
2025-05-07T20:26:35.7922039Z 
2025-05-07T20:26:35.7922245Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7922522Z 
2025-05-07T20:26:35.7922527Z 
2025-05-07T20:26:35.7922532Z 
2025-05-07T20:26:35.7922537Z 
2025-05-07T20:26:35.7922543Z 
2025-05-07T20:26:35.7922548Z 
2025-05-07T20:26:35.7922553Z 
2025-05-07T20:26:35.7922558Z 
2025-05-07T20:26:35.7922568Z 
2025-05-07T20:26:35.7922573Z 
2025-05-07T20:26:35.7922578Z 
2025-05-07T20:26:35.7922583Z 
2025-05-07T20:26:35.7922597Z 
2025-05-07T20:26:35.7922603Z 
2025-05-07T20:26:35.7922608Z 
2025-05-07T20:26:35.7922831Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7923118Z 
2025-05-07T20:26:35.7923124Z 
2025-05-07T20:26:35.7923129Z 
2025-05-07T20:26:35.7923134Z 
2025-05-07T20:26:35.7923139Z 
2025-05-07T20:26:35.7923144Z 
2025-05-07T20:26:35.7923149Z 
2025-05-07T20:26:35.7923154Z 
2025-05-07T20:26:35.7923159Z 
2025-05-07T20:26:35.7923164Z 
2025-05-07T20:26:35.7923169Z 
2025-05-07T20:26:35.7923174Z 
2025-05-07T20:26:35.7923179Z 
2025-05-07T20:26:35.7923184Z 
2025-05-07T20:26:35.7923189Z 
2025-05-07T20:26:35.7923195Z 
2025-05-07T20:26:35.7923414Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7923703Z 
2025-05-07T20:26:35.7923708Z 
2025-05-07T20:26:35.7923713Z 
2025-05-07T20:26:35.7923718Z 
2025-05-07T20:26:35.7923723Z 
2025-05-07T20:26:35.7923729Z 
2025-05-07T20:26:35.7923739Z 
2025-05-07T20:26:35.7923744Z 
2025-05-07T20:26:35.7923749Z 
2025-05-07T20:26:35.7923754Z 
2025-05-07T20:26:35.7923759Z 
2025-05-07T20:26:35.7923764Z 
2025-05-07T20:26:35.7923769Z 
2025-05-07T20:26:35.7923779Z 
2025-05-07T20:26:35.7923784Z 
2025-05-07T20:26:35.7923796Z 
2025-05-07T20:26:35.7923801Z 
2025-05-07T20:26:35.7924016Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7924315Z 
2025-05-07T20:26:35.7924321Z 
2025-05-07T20:26:35.7924326Z 
2025-05-07T20:26:35.7924331Z 
2025-05-07T20:26:35.7924336Z 
2025-05-07T20:26:35.7924349Z 
2025-05-07T20:26:35.7924354Z 
2025-05-07T20:26:35.7924359Z 
2025-05-07T20:26:35.7924365Z 
2025-05-07T20:26:35.7924370Z 
2025-05-07T20:26:35.7924375Z 
2025-05-07T20:26:35.7924380Z 
2025-05-07T20:26:35.7924385Z 
2025-05-07T20:26:35.7924390Z 
2025-05-07T20:26:35.7924395Z 
2025-05-07T20:26:35.7924400Z 
2025-05-07T20:26:35.7924406Z 
2025-05-07T20:26:35.7924411Z 
2025-05-07T20:26:35.7924648Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7925058Z 
2025-05-07T20:26:35.7925065Z 
2025-05-07T20:26:35.7925211Z [A
2025-05-07T20:26:35.7925362Z 
2025-05-07T20:26:35.7925367Z 
2025-05-07T20:26:35.7925507Z [A[A
2025-05-07T20:26:35.7925742Z 
2025-05-07T20:26:35.7925747Z 
2025-05-07T20:26:35.7925753Z 
2025-05-07T20:26:35.7925904Z [A[A[A
2025-05-07T20:26:35.7926060Z 
2025-05-07T20:26:35.7926065Z 
2025-05-07T20:26:35.7926069Z 
2025-05-07T20:26:35.7926074Z 
2025-05-07T20:26:35.7926239Z [A[A[A[A
2025-05-07T20:26:35.7926401Z 
2025-05-07T20:26:35.7926407Z 
2025-05-07T20:26:35.7926412Z 
2025-05-07T20:26:35.7926417Z 
2025-05-07T20:26:35.7926423Z 
2025-05-07T20:26:35.7926585Z [A[A[A[A[A
2025-05-07T20:26:35.7926762Z 
2025-05-07T20:26:35.7926767Z 
2025-05-07T20:26:35.7926772Z 
2025-05-07T20:26:35.7926777Z 
2025-05-07T20:26:35.7926782Z 
2025-05-07T20:26:35.7926787Z 
2025-05-07T20:26:35.7926952Z [A[A[A[A[A[A
2025-05-07T20:26:35.7927130Z 
2025-05-07T20:26:35.7927135Z 
2025-05-07T20:26:35.7927140Z 
2025-05-07T20:26:35.7927145Z 
2025-05-07T20:26:35.7927157Z 
2025-05-07T20:26:35.7927163Z 
2025-05-07T20:26:35.7927168Z 
2025-05-07T20:26:35.7927336Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7927530Z 
2025-05-07T20:26:35.7927535Z 
2025-05-07T20:26:35.7927547Z 
2025-05-07T20:26:35.7927552Z 
2025-05-07T20:26:35.7927557Z 
2025-05-07T20:26:35.7927563Z 
2025-05-07T20:26:35.7927568Z 
2025-05-07T20:26:35.7927573Z 
2025-05-07T20:26:35.7927746Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7927955Z 
2025-05-07T20:26:35.7927960Z 
2025-05-07T20:26:35.7927965Z 
2025-05-07T20:26:35.7927971Z 
2025-05-07T20:26:35.7927975Z 
2025-05-07T20:26:35.7927981Z 
2025-05-07T20:26:35.7927986Z 
2025-05-07T20:26:35.7927991Z 
2025-05-07T20:26:35.7927996Z 
2025-05-07T20:26:35.7928190Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7928415Z 
2025-05-07T20:26:35.7928420Z 
2025-05-07T20:26:35.7928425Z 
2025-05-07T20:26:35.7928430Z 
2025-05-07T20:26:35.7928436Z 
2025-05-07T20:26:35.7928441Z 
2025-05-07T20:26:35.7928446Z 
2025-05-07T20:26:35.7928451Z 
2025-05-07T20:26:35.7928457Z 
2025-05-07T20:26:35.7928477Z 
2025-05-07T20:26:35.7928659Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7928892Z 
2025-05-07T20:26:35.7928898Z 
2025-05-07T20:26:35.7928903Z 
2025-05-07T20:26:35.7928913Z 
2025-05-07T20:26:35.7928918Z 
2025-05-07T20:26:35.7928924Z 
2025-05-07T20:26:35.7928936Z 
2025-05-07T20:26:35.7928942Z 
2025-05-07T20:26:35.7928947Z 
2025-05-07T20:26:35.7928952Z 
2025-05-07T20:26:35.7928957Z 
2025-05-07T20:26:35.7929139Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7929385Z 
2025-05-07T20:26:35.7929390Z 
2025-05-07T20:26:35.7929396Z 
2025-05-07T20:26:35.7929409Z 
2025-05-07T20:26:35.7929414Z 
2025-05-07T20:26:35.7929419Z 
2025-05-07T20:26:35.7929424Z 
2025-05-07T20:26:35.7929429Z 
2025-05-07T20:26:35.7929434Z 
2025-05-07T20:26:35.7929440Z 
2025-05-07T20:26:35.7929445Z 
2025-05-07T20:26:35.7929450Z 
2025-05-07T20:26:35.7929633Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7929900Z 
2025-05-07T20:26:35.7929905Z 
2025-05-07T20:26:35.7929911Z 
2025-05-07T20:26:35.7929915Z 
2025-05-07T20:26:35.7929926Z 
2025-05-07T20:26:35.7929930Z 
2025-05-07T20:26:35.7929936Z 
2025-05-07T20:26:35.7929941Z 
2025-05-07T20:26:35.7929946Z 
2025-05-07T20:26:35.7929951Z 
2025-05-07T20:26:35.7929963Z 
2025-05-07T20:26:35.7929968Z 
2025-05-07T20:26:35.7929973Z 
2025-05-07T20:26:35.7930162Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7930438Z 
2025-05-07T20:26:35.7930444Z 
2025-05-07T20:26:35.7930449Z 
2025-05-07T20:26:35.7930454Z 
2025-05-07T20:26:35.7930459Z 
2025-05-07T20:26:35.7930464Z 
2025-05-07T20:26:35.7930469Z 
2025-05-07T20:26:35.7930474Z 
2025-05-07T20:26:35.7930480Z 
2025-05-07T20:26:35.7930485Z 
2025-05-07T20:26:35.7930490Z 
2025-05-07T20:26:35.7930496Z 
2025-05-07T20:26:35.7930529Z 
2025-05-07T20:26:35.7930534Z 
2025-05-07T20:26:35.7930732Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7931013Z 
2025-05-07T20:26:35.7931018Z 
2025-05-07T20:26:35.7931023Z 
2025-05-07T20:26:35.7931035Z 
2025-05-07T20:26:35.7931041Z 
2025-05-07T20:26:35.7931045Z 
2025-05-07T20:26:35.7931156Z 
2025-05-07T20:26:35.7931164Z 
2025-05-07T20:26:35.7931169Z 
2025-05-07T20:26:35.7931174Z 
2025-05-07T20:26:35.7931180Z 
2025-05-07T20:26:35.7931185Z 
2025-05-07T20:26:35.7931273Z 
2025-05-07T20:26:35.7931278Z 
2025-05-07T20:26:35.7931283Z 
2025-05-07T20:26:35.7931506Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7931797Z 
2025-05-07T20:26:35.7931802Z 
2025-05-07T20:26:35.7931808Z 
2025-05-07T20:26:35.7931813Z 
2025-05-07T20:26:35.7931818Z 
2025-05-07T20:26:35.7931823Z 
2025-05-07T20:26:35.7931828Z 
2025-05-07T20:26:35.7931833Z 
2025-05-07T20:26:35.7931838Z 
2025-05-07T20:26:35.7931843Z 
2025-05-07T20:26:35.7931849Z 
2025-05-07T20:26:35.7931854Z 
2025-05-07T20:26:35.7931859Z 
2025-05-07T20:26:35.7931864Z 
2025-05-07T20:26:35.7931869Z 
2025-05-07T20:26:35.7931874Z 
2025-05-07T20:26:35.7932089Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7932381Z 
2025-05-07T20:26:35.7932386Z 
2025-05-07T20:26:35.7932391Z 
2025-05-07T20:26:35.7932396Z 
2025-05-07T20:26:35.7932409Z 
2025-05-07T20:26:35.7932414Z 
2025-05-07T20:26:35.7932419Z 
2025-05-07T20:26:35.7932424Z 
2025-05-07T20:26:35.7932429Z 
2025-05-07T20:26:35.7932435Z 
2025-05-07T20:26:35.7932454Z 
2025-05-07T20:26:35.7932459Z 
2025-05-07T20:26:35.7932465Z 
2025-05-07T20:26:35.7932470Z 
2025-05-07T20:26:35.7932475Z 
2025-05-07T20:26:35.7932480Z 
2025-05-07T20:26:35.7932485Z 
2025-05-07T20:26:35.7932702Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7933008Z 
2025-05-07T20:26:35.7933013Z 
2025-05-07T20:26:35.7933018Z 
2025-05-07T20:26:35.7933023Z 
2025-05-07T20:26:35.7933028Z 
2025-05-07T20:26:35.7933033Z 
2025-05-07T20:26:35.7933039Z 
2025-05-07T20:26:35.7933044Z 
2025-05-07T20:26:35.7933049Z 
2025-05-07T20:26:35.7933054Z 
2025-05-07T20:26:35.7933059Z 
2025-05-07T20:26:35.7933064Z 
2025-05-07T20:26:35.7933069Z 
2025-05-07T20:26:35.7933074Z 
2025-05-07T20:26:35.7933079Z 
2025-05-07T20:26:35.7933084Z 
2025-05-07T20:26:35.7933090Z 
2025-05-07T20:26:35.7933094Z 
2025-05-07T20:26:35.7933333Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7933631Z 
2025-05-07T20:26:35.7933637Z 
2025-05-07T20:26:35.7933774Z [A
2025-05-07T20:26:35.7933928Z 
2025-05-07T20:26:35.7933940Z 
2025-05-07T20:26:35.7934068Z [A[A
2025-05-07T20:26:35.7934178Z 
2025-05-07T20:26:35.7934188Z 
2025-05-07T20:26:35.7934192Z 
2025-05-07T20:26:35.7934296Z [A[A[A
2025-05-07T20:26:35.7934405Z 
2025-05-07T20:26:35.7934409Z 
2025-05-07T20:26:35.7934412Z 
2025-05-07T20:26:35.7934416Z 
2025-05-07T20:26:35.7934546Z [A[A[A[A
2025-05-07T20:26:35.7934664Z 
2025-05-07T20:26:35.7934668Z 
2025-05-07T20:26:35.7934672Z 
2025-05-07T20:26:35.7934680Z 
2025-05-07T20:26:35.7934684Z 
2025-05-07T20:26:35.7934794Z [A[A[A[A[A
2025-05-07T20:26:35.7934917Z 
2025-05-07T20:26:35.7934921Z 
2025-05-07T20:26:35.7934924Z 
2025-05-07T20:26:35.7934928Z 
2025-05-07T20:26:35.7934932Z 
2025-05-07T20:26:35.7934941Z 
2025-05-07T20:26:35.7935052Z [A[A[A[A[A[A
2025-05-07T20:26:35.7935191Z 
2025-05-07T20:26:35.7935195Z 
2025-05-07T20:26:35.7935203Z 
2025-05-07T20:26:35.7935206Z 
2025-05-07T20:26:35.7935210Z 
2025-05-07T20:26:35.7935214Z 
2025-05-07T20:26:35.7935217Z 
2025-05-07T20:26:35.7935338Z [A[A[A[A[A[A[A
2025-05-07T20:26:35.7935483Z 
2025-05-07T20:26:35.7935487Z 
2025-05-07T20:26:35.7935490Z 
2025-05-07T20:26:35.7935494Z 
2025-05-07T20:26:35.7935498Z 
2025-05-07T20:26:35.7935501Z 
2025-05-07T20:26:35.7935505Z 
2025-05-07T20:26:35.7935509Z 
2025-05-07T20:26:35.7935634Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7935785Z 
2025-05-07T20:26:35.7935788Z 
2025-05-07T20:26:35.7935792Z 
2025-05-07T20:26:35.7935795Z 
2025-05-07T20:26:35.7935799Z 
2025-05-07T20:26:35.7935803Z 
2025-05-07T20:26:35.7935806Z 
2025-05-07T20:26:35.7935816Z 
2025-05-07T20:26:35.7935819Z 
2025-05-07T20:26:35.7935945Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7936101Z 
2025-05-07T20:26:35.7936105Z 
2025-05-07T20:26:35.7936108Z 
2025-05-07T20:26:35.7936112Z 
2025-05-07T20:26:35.7936116Z 
2025-05-07T20:26:35.7936119Z 
2025-05-07T20:26:35.7936128Z 
2025-05-07T20:26:35.7936230Z 
2025-05-07T20:26:35.7936235Z 
2025-05-07T20:26:35.7936239Z 
2025-05-07T20:26:35.7936369Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7936533Z 
2025-05-07T20:26:35.7936613Z 
2025-05-07T20:26:35.7936616Z 
2025-05-07T20:26:35.7936626Z 
2025-05-07T20:26:35.7936630Z 
2025-05-07T20:26:35.7936633Z 
2025-05-07T20:26:35.7936637Z 
2025-05-07T20:26:35.7936641Z 
2025-05-07T20:26:35.7936644Z 
2025-05-07T20:26:35.7936648Z 
2025-05-07T20:26:35.7936651Z 
2025-05-07T20:26:35.7936803Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7937075Z 
2025-05-07T20:26:35.7937080Z 
2025-05-07T20:26:35.7937085Z 
2025-05-07T20:26:35.7937091Z 
2025-05-07T20:26:35.7937096Z 
2025-05-07T20:26:35.7937101Z 
2025-05-07T20:26:35.7937107Z 
2025-05-07T20:26:35.7937112Z 
2025-05-07T20:26:35.7937117Z 
2025-05-07T20:26:35.7937122Z 
2025-05-07T20:26:35.7937128Z 
2025-05-07T20:26:35.7937133Z 
2025-05-07T20:26:35.7937336Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:35.7937606Z 
2025-05-07T20:26:35.7937619Z 
2025-05-07T20:26:35.7937624Z 
2025-05-07T20:26:35.7937629Z 
2025-05-07T20:26:35.7937635Z 
2025-05-07T20:26:35.7937639Z 
2025-05-07T20:26:35.7937645Z 
2025-05-07T20:26:35.7937650Z 
2025-05-07T20:26:35.7937664Z 
2025-05-07T20:26:35.7937669Z 
2025-05-07T20:26:35.7937674Z 
2025-05-07T20:26:35.7937679Z 
2025-05-07T20:26:35.7937684Z 
2025-05-07T20:26:35.7937904Z [A[A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:36.1011745Z Preparing transaction: \ | / done
2025-05-07T20:26:42.6769472Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:26:43.5950038Z Executing transaction: / - \ | / - \ | / done
2025-05-07T20:26:46.1935412Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:46.1936018Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:46.1936751Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:46.1937344Z 
2025-05-07T20:26:46.1949341Z 
2025-05-07T20:26:46.1950550Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:46.1951533Z 
2025-05-07T20:26:46.1963712Z 
2025-05-07T20:26:46.1964160Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:46.1969172Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:46.1975969Z 
2025-05-07T20:26:46.3710050Z 
2025-05-07T20:26:46.3716293Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:46.3720553Z 
2025-05-07T20:26:46.3738427Z 
2025-05-07T20:26:46.3738900Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:46.4113972Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:48.2872777Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:48.3499136Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:48.3499877Z 
2025-05-07T20:26:48.7723066Z 
2025-05-07T20:26:48.7731228Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:48.8085056Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:48.8085727Z 
2025-05-07T20:26:49.2377326Z 
2025-05-07T20:26:49.2377692Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:49.2378643Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:49.2379375Z 
2025-05-07T20:26:49.6608493Z 
2025-05-07T20:26:51.6828623Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:53.7172346Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:55.7367311Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:55.7368387Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:57.7594543Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:59.6557892Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:26:59.6558189Z 
2025-05-07T20:26:59.7173281Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:27:03.5709875Z /tmp/tmpmahzl8qv: line 3: clang: command not found
2025-05-07T20:27:03.5710169Z 
2025-05-07T20:27:03.5710928Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:27:03.6338520Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:27:03.6338858Z 
2025-05-07T20:27:03.6358470Z total 36
2025-05-07T20:27:03.6359117Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:27:03.6359721Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:27:03.6360294Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:27:03.6361002Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:27:03.6361717Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:27:03.6362392Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:27:03.6362913Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:27:03.6363392Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:27:03.6363697Z 
2025-05-07T20:27:03.6364257Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:27:03.6364952Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:27:03.6365583Z 
2025-05-07T20:27:03.6383758Z 
2025-05-07T20:27:03.6384043Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:27:03.6384316Z 
2025-05-07T20:27:05.5992065Z 
2025-05-07T20:27:05.5992984Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:27:05.5993556Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:27:05.5994018Z 
2025-05-07T20:27:06.0410821Z 
2025-05-07T20:27:06.0411177Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:27:06.0411448Z 
2025-05-07T20:27:07.9413099Z -allow-unsupported-compiler
2025-05-07T20:27:07.9413639Z 
2025-05-07T20:27:08.0055047Z 
2025-05-07T20:27:08.0055713Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:27:08.0056275Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:27:08.0056614Z 
2025-05-07T20:27:09.9635111Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:27:09.9635776Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:27:09.9636135Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:27:09.9636469Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:27:09.9636818Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:27:09.9637100Z #define _STL_PAIR_H 1
2025-05-07T20:27:09.9637360Z #define __cpp_attributes 200809L
2025-05-07T20:27:09.9637734Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:27:09.9638240Z #define __DELETE_THROW throw()
2025-05-07T20:27:09.9638609Z #define _PTRDIFF_T_ 
2025-05-07T20:27:09.9638957Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:27:09.9639339Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:27:09.9639624Z #define _IO_LEFT 02
2025-05-07T20:27:09.9639876Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:27:09.9640253Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:27:09.9640543Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:27:09.9640996Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:27:09.9641487Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:27:09.9641906Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:27:09.9642273Z #define _IOS_OUTPUT 2
2025-05-07T20:27:09.9642622Z #define __SM_100_RT_HPP__ 
2025-05-07T20:27:09.9643074Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:27:09.9643602Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:27:09.9644054Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:27:09.9644433Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:27:09.9644834Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:27:09.9645958Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:27:09.9647095Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:27:09.9647578Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:27:09.9648031Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:27:09.9648471Z #define _T_WCHAR_ 
2025-05-07T20:27:09.9648781Z #define stdout stdout
2025-05-07T20:27:09.9649247Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:27:09.9649796Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:27:09.9661112Z #define __flexarr []
2025-05-07T20:27:09.9661517Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:27:09.9662005Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:27:09.9662509Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:27:09.9662867Z #define _MATH_H 1
2025-05-07T20:27:09.9663262Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:27:09.9663755Z #define __S64_TYPE long int
2025-05-07T20:27:09.9664111Z #define __stub_fchflags 
2025-05-07T20:27:09.9664806Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:27:09.9665131Z #define __SQUAD_TYPE long int
2025-05-07T20:27:09.9665408Z #define __INTMAX_C(c) c ## L
2025-05-07T20:27:09.9665866Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:27:09.9666224Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:27:09.9666499Z #define NL_NMAX INT_MAX
2025-05-07T20:27:09.9666735Z #define _BITS_TIME_H 1
2025-05-07T20:27:09.9667022Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:27:09.9667369Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:27:09.9667680Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:27:09.9668044Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:27:09.9668457Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:27:09.9668839Z #define __CHAR_BIT__ 8
2025-05-07T20:27:09.9669104Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.9669433Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:27:09.9669747Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:27:09.9670018Z #define FP_NAN 0
2025-05-07T20:27:09.9670288Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:27:09.9670729Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:27:09.9671125Z #define __cudaCDP2GetErrorString 
2025-05-07T20:27:09.9671451Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:27:09.9671746Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:27:09.9672002Z #define __SM_80_RT_H__ 
2025-05-07T20:27:09.9672236Z #define _NEW 
2025-05-07T20:27:09.9672472Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:27:09.9672753Z #define __UINT8_MAX__ 0xff
2025-05-07T20:27:09.9673138Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:27:09.9673558Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:27:09.9673808Z #define __USE_ANSI 1
2025-05-07T20:27:09.9674102Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:27:09.9674521Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:27:09.9674897Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:27:09.9675206Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:27:09.9675501Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:27:09.9675803Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:27:09.9676089Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:27:09.9676386Z #define PIPE_BUF 4096
2025-05-07T20:27:09.9676725Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:27:09.9677194Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:27:09.9677582Z #define ADJ_TICK 0x4000
2025-05-07T20:27:09.9677873Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:27:09.9678210Z #define MQ_PRIO_MAX 32768
2025-05-07T20:27:09.9678480Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:27:09.9678813Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:27:09.9679296Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:09.9679841Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:27:09.9680322Z #define _XOPEN_SOURCE 700
2025-05-07T20:27:09.9680587Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:27:09.9680873Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.9681168Z #define __cpp_static_assert 201411L
2025-05-07T20:27:09.9681472Z #define __GLIBCXX__ 20230528
2025-05-07T20:27:09.9681779Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:27:09.9682073Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:27:09.9682361Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:27:09.9682674Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:27:09.9682957Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:27:09.9683266Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.9683637Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:27:09.9683986Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:27:09.9684276Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:27:09.9684698Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.9685067Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:27:09.9685432Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:27:09.9685814Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:27:09.9686114Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:27:09.9686451Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:27:09.9686786Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:27:09.9687204Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:27:09.9687624Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:27:09.9687936Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:27:09.9688212Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:27:09.9688497Z #define __GCC_IEC_559 2
2025-05-07T20:27:09.9688801Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:27:09.9689149Z #define _IO_flockfile(_fp) 
2025-05-07T20:27:09.9689420Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:27:09.9689700Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:27:09.9689974Z #define _IOFBF 0
2025-05-07T20:27:09.9690186Z #define __USE_BSD 1
2025-05-07T20:27:09.9690427Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:27:09.9690704Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:27:09.9690986Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:27:09.9691244Z #define _IO_NO_WRITES 8
2025-05-07T20:27:09.9691507Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:27:09.9691876Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:27:09.9692234Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:27:09.9692550Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:27:09.9692881Z #define __cpp_binary_literals 201304L
2025-05-07T20:27:09.9693176Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:27:09.9693454Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:27:09.9693732Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:27:09.9694046Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:27:09.9694451Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:27:09.9694830Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:27:09.9695151Z #define M_PI 3.14159265358979323846
2025-05-07T20:27:09.9695464Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:27:09.9695804Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:27:09.9696128Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:27:09.9696433Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:27:09.9696719Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:27:09.9696998Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:27:09.9697592Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:27:09.9698196Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:27:09.9699275Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:27:09.9699985Z 
2025-05-07T20:27:09.9700124Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:27:09.9700462Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:27:09.9700772Z #define __cudaCDP2GetErrorName 
2025-05-07T20:27:09.9701063Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:27:09.9701358Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:27:09.9701700Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:27:09.9702038Z #define __cpp_variadic_templates 200704L
2025-05-07T20:27:09.9702346Z #define RAND_MAX 2147483647
2025-05-07T20:27:09.9702613Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:27:09.9702951Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.9703278Z #define __SM_90_RT_H__ 
2025-05-07T20:27:09.9703524Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:27:09.9703791Z #define __COMPAR_FN_T 
2025-05-07T20:27:09.9704036Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:27:09.9704388Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:27:09.9704881Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:27:09.9705480Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:27:09.9705828Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:27:09.9706198Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:27:09.9706505Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:09.9706854Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:27:09.9707174Z #define __cpp_variable_templates 201304L
2025-05-07T20:27:09.9707696Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:09.9708254Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:27:09.9708588Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:27:09.9708880Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:27:09.9709193Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:27:09.9709505Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:27:09.9709786Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:27:09.9710064Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:27:09.9710329Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:27:09.9710592Z #define __u_char_defined 
2025-05-07T20:27:09.9710918Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:27:09.9711289Z #define STA_PPSERROR 0x0800
2025-05-07T20:27:09.9711548Z #define _GLIBCXX_STD_A std
2025-05-07T20:27:09.9711814Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:27:09.9712104Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:27:09.9712553Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:27:09.9712990Z #define FP_INFINITE 1
2025-05-07T20:27:09.9713702Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:09.9714221Z #define _IO_pid_t __pid_t
2025-05-07T20:27:09.9714487Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:27:09.9714765Z #define __LEAF , __leaf__
2025-05-07T20:27:09.9715006Z #define PATH_MAX 4096
2025-05-07T20:27:09.9715266Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:27:09.9715614Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:27:09.9715954Z #define _LIMITS_H___ 
2025-05-07T20:27:09.9716181Z #define __size_t 
2025-05-07T20:27:09.9716419Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:27:09.9716979Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:27:09.9717574Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:27:09.9717885Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:27:09.9718231Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:27:09.9718502Z #define _WCHAR_T_DEFINED 
2025-05-07T20:27:09.9718757Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:27:09.9719062Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:27:09.9719400Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:27:09.9719694Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:27:09.9719983Z #define __INT8_C(c) c
2025-05-07T20:27:09.9720322Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:27:09.9720624Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:27:09.9720904Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:27:09.9721174Z #define __SM_70_RT_HPP__ 
2025-05-07T20:27:09.9721433Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:27:09.9721705Z #define __cpp_variadic_using 201611L
2025-05-07T20:27:09.9722040Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:09.9722380Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:27:09.9722653Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:27:09.9722935Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:27:09.9723204Z #define __cpp_capture_star_this 201603L
2025-05-07T20:27:09.9723519Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:27:09.9723832Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:27:09.9724204Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:27:09.9724737Z #define NFDBITS __NFDBITS
2025-05-07T20:27:09.9725008Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:27:09.9725305Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:27:09.9725635Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:27:09.9726097Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:27:09.9726362Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:27:09.9726659Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:27:09.9726969Z #define STA_UNSYNC 0x0040
2025-05-07T20:27:09.9727290Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:09.9727727Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:27:09.9728091Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:27:09.9728386Z #define __cpp_if_constexpr 201606L
2025-05-07T20:27:09.9728710Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:27:09.9729040Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:27:09.9729369Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:27:09.9729722Z #define __daddr_t_defined 
2025-05-07T20:27:09.9729980Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:27:09.9730259Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:27:09.9730590Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:27:09.9731124Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:27:09.9731666Z #define _ACRTIMP 
2025-05-07T20:27:09.9731898Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:27:09.9732172Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:27:09.9732463Z #define _IOS_BIN 128
2025-05-07T20:27:09.9732825Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:27:09.9733253Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.9733528Z #define UNDERFLOW 4
2025-05-07T20:27:09.9733763Z #define NAME_MAX 255
2025-05-07T20:27:09.9734007Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:27:09.9734286Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:27:09.9734576Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:27:09.9734879Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:27:09.9735269Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:27:09.9735669Z #define __ptr_t void *
2025-05-07T20:27:09.9735922Z #define M_E 2.7182818284590452354
2025-05-07T20:27:09.9736208Z #define cudaSurfaceType1D 0x01
2025-05-07T20:27:09.9736478Z #define __USE_ISOCXX11 1
2025-05-07T20:27:09.9736752Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:27:09.9737255Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:27:09.9737554Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:27:09.9737838Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:27:09.9738135Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:27:09.9738463Z #define cudaSurfaceType2D 0x02
2025-05-07T20:27:09.9738732Z #define __linux 1
2025-05-07T20:27:09.9738967Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:27:09.9739247Z #define cudaDeviceMask 0xff
2025-05-07T20:27:09.9739522Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:27:09.9739823Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:27:09.9740110Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:27:09.9740405Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:27:09.9740724Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:27:09.9741035Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:27:09.9741329Z #define _BITS_TYPES_H 1
2025-05-07T20:27:09.9741623Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:27:09.9741968Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:27:09.9742270Z #define cudaSurfaceType3D 0x03
2025-05-07T20:27:09.9742558Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:27:09.9742857Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:27:09.9743150Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:27:09.9744105Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:27:09.9744942Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:27:09.9745234Z #define __unix 1
2025-05-07T20:27:09.9745450Z #define MATH_ERRNO 1
2025-05-07T20:27:09.9745834Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:27:09.9746246Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:27:09.9746525Z #define __SM_100_RT_H__ 
2025-05-07T20:27:09.9746788Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:27:09.9747087Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:27:09.9747377Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:27:09.9747662Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:27:09.9747974Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:27:09.9748450Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:27:09.9748924Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:27:09.9749232Z #define CUDARTAPI_CDECL 
2025-05-07T20:27:09.9749496Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:27:09.9749782Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:27:09.9750077Z #define __cpp_lib_void_t 201411
2025-05-07T20:27:09.9750347Z #define _POSIX_AIO_MAX 1
2025-05-07T20:27:09.9750597Z #define __SIZE_T 
2025-05-07T20:27:09.9750856Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:27:09.9751187Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:27:09.9751486Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:27:09.9751756Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:27:09.9752035Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:27:09.9752299Z #define _ATFILE_SOURCE 1
2025-05-07T20:27:09.9752699Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:27:09.9753144Z #define __WAIT_STATUS void *
2025-05-07T20:27:09.9753415Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:27:09.9753687Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:27:09.9753962Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:27:09.9754254Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:27:09.9754536Z #define __WINT_MIN__ 0U
2025-05-07T20:27:09.9755124Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:27:09.9755791Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:27:09.9756093Z #define WUNTRACED 2
2025-05-07T20:27:09.9756330Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:27:09.9756615Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:27:09.9756908Z #define NZERO 20
2025-05-07T20:27:09.9757136Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:27:09.9757422Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:27:09.9757722Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:27:09.9758014Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:27:09.9758276Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:27:09.9758569Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:27:09.9758846Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:27:09.9759136Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:27:09.9759416Z #define EXIT_FAILURE 1
2025-05-07T20:27:09.9759657Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:27:09.9759924Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:27:09.9760330Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:27:09.9760588Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:27:09.9760877Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:27:09.9761231Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:27:09.9761600Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:27:09.9761900Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:27:09.9762161Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:27:09.9762440Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:27:09.9762739Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:27:09.9763056Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:27:09.9763356Z #define SEEK_DATA 3
2025-05-07T20:27:09.9763587Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:27:09.9763895Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:27:09.9764437Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:27:09.9764836Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:27:09.9765097Z #define __INT64_C(c) c ## L
2025-05-07T20:27:09.9765451Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:27:09.9765789Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:27:09.9766124Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:27:09.9766412Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:27:09.9766723Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:27:09.9767027Z #define STA_PPSWANDER 0x0400
2025-05-07T20:27:09.9767291Z #define __INT_WCHAR_T_H 
2025-05-07T20:27:09.9767538Z #define WSTOPPED 2
2025-05-07T20:27:09.9767773Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:27:09.9768065Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:27:09.9768323Z #define FP_NORMAL 4
2025-05-07T20:27:09.9768562Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:27:09.9768856Z #define _BITS_TIMEX_H 1
2025-05-07T20:27:09.9769108Z #define _POSIX_LINK_MAX 8
2025-05-07T20:27:09.9769364Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:27:09.9769654Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:27:09.9769940Z #define cudaTextureType1D 0x01
2025-05-07T20:27:09.9770211Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:27:09.9770480Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:27:09.9770757Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:27:09.9771059Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:27:09.9771492Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:27:09.9771952Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:27:09.9772222Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:27:09.9772486Z #define _POSIX_SOURCE 1
2025-05-07T20:27:09.9772740Z #define cudaTextureType2D 0x02
2025-05-07T20:27:09.9773010Z #define _PTR_TRAITS_H 1
2025-05-07T20:27:09.9773281Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:27:09.9773602Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:27:09.9773886Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:27:09.9774210Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:27:09.9774556Z #define cudaTextureType3D 0x03
2025-05-07T20:27:09.9774839Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:27:09.9775101Z #define CLOCK_REALTIME 0
2025-05-07T20:27:09.9775355Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:27:09.9775638Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:27:09.9775952Z #define __cpp_aligned_new 201606L
2025-05-07T20:27:09.9776233Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:27:09.9776520Z #define cudaEventBlockingSync 0x01
2025-05-07T20:27:09.9776815Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:27:09.9777091Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:27:09.9777405Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:27:09.9777721Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:27:09.9778007Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:27:09.9778270Z #define __GLIBC__ 2
2025-05-07T20:27:09.9778499Z #define __END_DECLS }
2025-05-07T20:27:09.9778736Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:27:09.9779116Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:27:09.9779508Z #define __CONCAT(x,y) x ## y
2025-05-07T20:27:09.9779761Z #define WCONTINUED 8
2025-05-07T20:27:09.9780004Z #define __STDC_HOSTED__ 1
2025-05-07T20:27:09.9780271Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:27:09.9780553Z #define _ALLOCA_H 1
2025-05-07T20:27:09.9780780Z #define __host__ __location__(host)
2025-05-07T20:27:09.9781218Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:27:09.9781669Z #define __SLONG32_TYPE int
2025-05-07T20:27:09.9781941Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:27:09.9782235Z #define _SYS_SELECT_H 1
2025-05-07T20:27:09.9782479Z #define _IO_LINE_BUF 0x200
2025-05-07T20:27:09.9782731Z #define _IOS_NOCREATE 32
2025-05-07T20:27:09.9782990Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:27:09.9783364Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:27:09.9783663Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:27:09.9783958Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:27:09.9784329Z #define __global__ __location__(global)
2025-05-07T20:27:09.9830416Z #define __GNU_LIBRARY__ 6
2025-05-07T20:27:09.9830715Z #define __cpp_decltype_auto 201304L
2025-05-07T20:27:09.9831010Z #define __DBL_DIG__ 15
2025-05-07T20:27:09.9831246Z #define TIME_UTC 1
2025-05-07T20:27:09.9831468Z #define __FLT32_DIG__ 6
2025-05-07T20:27:09.9831799Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:27:09.9832208Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:27:09.9832533Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:27:09.9832854Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:27:09.9833156Z #define _G_BUFSIZ 8192
2025-05-07T20:27:09.9833470Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:27:09.9833862Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:27:09.9834170Z #define __cudaCDP2GetDevice 
2025-05-07T20:27:09.9834464Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:27:09.9834763Z #define STA_CLOCKERR 0x1000
2025-05-07T20:27:09.9835023Z #define __GXX_WEAK__ 1
2025-05-07T20:27:09.9835281Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.9835589Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:27:09.9835853Z #define __SHRT_WIDTH__ 16
2025-05-07T20:27:09.9836155Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:27:09.9836498Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:27:09.9836780Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:27:09.9837076Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:27:09.9837381Z #define _G_config_h 1
2025-05-07T20:27:09.9837669Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:27:09.9838020Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:27:09.9838303Z #define _GCC_WCHAR_T 
2025-05-07T20:27:09.9838545Z #define TMP_MAX 238328
2025-05-07T20:27:09.9838800Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:27:09.9839071Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:27:09.9839340Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.9839629Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:27:09.9839915Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:27:09.9840299Z #define _IO_SKIPWS 01
2025-05-07T20:27:09.9840722Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:27:09.9841190Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:27:09.9841464Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:27:09.9841806Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:27:09.9842184Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:27:09.9842563Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:27:09.9842930Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:27:09.9843191Z #define le32toh(x) (x)
2025-05-07T20:27:09.9843431Z #define _SIZE_T_DEFINED 
2025-05-07T20:27:09.9843691Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:27:09.9844043Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:27:09.9844409Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:27:09.9844824Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:27:09.9845244Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:27:09.9845521Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:27:09.9845792Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:27:09.9846062Z #define _POSIX_NAME_MAX 14
2025-05-07T20:27:09.9846348Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:27:09.9846895Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:27:09.9847406Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:27:09.9847731Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:27:09.9848096Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:27:09.9848425Z #define _WCHAR_T_ 
2025-05-07T20:27:09.9849009Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:27:09.9849393Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:27:09.9849788Z #define RTSIG_MAX 32
2025-05-07T20:27:09.9850159Z #define _STDDEF_H 
2025-05-07T20:27:09.9850400Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:27:09.9850685Z #define _VA_LIST_DEFINED 
2025-05-07T20:27:09.9850940Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:27:09.9851287Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:27:09.9851691Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:27:09.9852022Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:27:09.9852324Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:27:09.9852799Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:27:09.9853341Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:27:09.9853717Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:27:09.9854057Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:27:09.9854386Z #define __unix__ 1
2025-05-07T20:27:09.9854624Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.9854921Z #define __INT_WIDTH__ 32
2025-05-07T20:27:09.9855175Z #define __SIZEOF_LONG__ 8
2025-05-07T20:27:09.9855417Z #define _IONBF 2
2025-05-07T20:27:09.9855877Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:27:09.9856664Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:27:09.9857212Z #define __STDC_IEC_559__ 1
2025-05-07T20:27:09.9857474Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:27:09.9857754Z #define __UINT16_C(c) c
2025-05-07T20:27:09.9858003Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:27:09.9858281Z #define STA_DEL 0x0020
2025-05-07T20:27:09.9858529Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:27:09.9858799Z #define __id_t_defined 
2025-05-07T20:27:09.9859075Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:27:09.9859542Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:27:09.9859989Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:27:09.9860261Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:27:09.9860531Z #define __DECIMAL_DIG__ 21
2025-05-07T20:27:09.9860795Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:27:09.9861063Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:27:09.9861345Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:27:09.9861661Z #define SING 2
2025-05-07T20:27:09.9861886Z #define STA_FREQHOLD 0x0080
2025-05-07T20:27:09.9862160Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.9862472Z #define cudaStreamDefault 0x00
2025-05-07T20:27:09.9862841Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:27:09.9863220Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:27:09.9863500Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:27:09.9863782Z #define __gnu_linux__ 1
2025-05-07T20:27:09.9864024Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:27:09.9864291Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:27:09.9864547Z #define MAX_INPUT 255
2025-05-07T20:27:09.9864836Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:27:09.9865181Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:27:09.9865564Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:27:09.9865892Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:27:09.9866165Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:27:09.9866575Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:27:09.9867015Z #define _IO_SHOWPOS 02000
2025-05-07T20:27:09.9867355Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:27:09.9867730Z #define _Mfloat_ float
2025-05-07T20:27:09.9868005Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:27:09.9868322Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:27:09.9868712Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:27:09.9869051Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:27:09.9869605Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:27:09.9870192Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.9870480Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:27:09.9870823Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:27:09.9871191Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:27:09.9871514Z #define __USE_ISOC11 1
2025-05-07T20:27:09.9871784Z #define _BSD_SIZE_T_ 
2025-05-07T20:27:09.9872018Z #define ADJ_MICRO 0x1000
2025-05-07T20:27:09.9872276Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:27:09.9872548Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:27:09.9872854Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:27:09.9873188Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:27:09.9873516Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:27:09.9873854Z #define __THROW throw ()
2025-05-07T20:27:09.9874115Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:27:09.9874419Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:09.9874790Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:27:09.9875149Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:27:09.9875432Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:27:09.9875703Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:27:09.9875972Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:27:09.9876241Z #define L_tmpnam 20
2025-05-07T20:27:09.9876471Z #define ___int_wchar_t_h 
2025-05-07T20:27:09.9876819Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:27:09.9877211Z #define isascii(c) __isascii (c)
2025-05-07T20:27:09.9877478Z #define _T_PTRDIFF 
2025-05-07T20:27:09.9877778Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:27:09.9878141Z #define toascii(c) __toascii (c)
2025-05-07T20:27:09.9878409Z #define __GNUC__ 11
2025-05-07T20:27:09.9878665Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:27:09.9878974Z #define __GXX_RTTI 1
2025-05-07T20:27:09.9879198Z #define __pie__ 2
2025-05-07T20:27:09.9879417Z #define __MMX__ 1
2025-05-07T20:27:09.9879638Z #define __cudaCDP2Malloc 
2025-05-07T20:27:09.9879902Z #define __timespec_defined 1
2025-05-07T20:27:09.9880222Z #define L_ctermid 9
2025-05-07T20:27:09.9880454Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:09.9880765Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:27:09.9881169Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:27:09.9881548Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:27:09.9881819Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:27:09.9882116Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:27:09.9882425Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:27:09.9882748Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:27:09.9883019Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:27:09.9883470Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:27:09.9884239Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:09.9884865Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:27:09.9885177Z #define __USE_SVID 1
2025-05-07T20:27:09.9885431Z #define __constant__ __location__(constant)
2025-05-07T20:27:09.9885757Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:27:09.9886063Z #define __device__ __location__(device)
2025-05-07T20:27:09.9886393Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:27:09.9886731Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:27:09.9887006Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:27:09.9887290Z #define CUDART_DEVICE __device__
2025-05-07T20:27:09.9887652Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:27:09.9888122Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:27:09.9888416Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:27:09.9888791Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:27:09.9889265Z #define __STDC_UTF_16__ 1
2025-05-07T20:27:09.9889516Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:27:09.9889888Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:27:09.9890322Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:27:09.9890645Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:27:09.9890922Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:27:09.9891196Z #define NGROUPS_MAX 65536
2025-05-07T20:27:09.9891458Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:27:09.9891724Z #define __USE_ISOC95 1
2025-05-07T20:27:09.9891955Z #define _TIME_H 1
2025-05-07T20:27:09.9892239Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:27:09.9892567Z #define __USE_ISOC99 1
2025-05-07T20:27:09.9892910Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:27:09.9893281Z #define HOST_NAME_MAX 64
2025-05-07T20:27:09.9893536Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:27:09.9893799Z #define _IOS_ATEND 4
2025-05-07T20:27:09.9894039Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:27:09.9894377Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:27:09.9894791Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:09.9895137Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:27:09.9895430Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:27:09.9895762Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:27:09.9896082Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:27:09.9896346Z #define _STDIO_H 1
2025-05-07T20:27:09.9896750Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:27:09.9897226Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:27:09.9897594Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:09.9897986Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:27:09.9898284Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:27:09.9898556Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:27:09.9898841Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:27:09.9899141Z #define __cpp_raw_strings 200710L
2025-05-07T20:27:09.9899446Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.9899771Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:27:09.9900050Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:27:09.9900335Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:27:09.9900647Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:27:09.9900927Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:27:09.9901227Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:27:09.9901587Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:27:09.9901964Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:27:09.9902222Z #define __USE_XOPEN 1
2025-05-07T20:27:09.9902474Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:27:09.9902929Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:09.9903377Z #define __USE_XOPEN2K 1
2025-05-07T20:27:09.9903629Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:27:09.9903905Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:27:09.9904209Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:27:09.9904486Z #define __cpp_fold_expressions 201603L
2025-05-07T20:27:09.9905022Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:27:09.9905559Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:27:09.9905852Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:27:09.9906220Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:27:09.9906618Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:27:09.9907007Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:27:09.9907402Z #define __END_NAMESPACE_C99 
2025-05-07T20:27:09.9907765Z #define __glibcxx_integral_traps true
2025-05-07T20:27:09.9908063Z #define _POSIX_PATH_MAX 256
2025-05-07T20:27:09.9908319Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:27:09.9908583Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:27:09.9908934Z #define _IOS_TRUNC 16
2025-05-07T20:27:09.9909171Z #define _ISOC11_SOURCE 1
2025-05-07T20:27:09.9909435Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:27:09.9909736Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:27:09.9910040Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:27:09.9910419Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:27:09.9910809Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:27:09.9911093Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:27:09.9911361Z #define _IO_UNITBUF 020000
2025-05-07T20:27:09.9911621Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:27:09.9911889Z #define __FD_SETSIZE 1024
2025-05-07T20:27:09.9912143Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:27:09.9912419Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:27:09.9912778Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:27:09.9913141Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:27:09.9913924Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:27:09.9914294Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:27:09.9914617Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:27:09.9914898Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:27:09.9915211Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:27:09.9915558Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:27:09.9915848Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:27:09.9916182Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:27:09.9916475Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:27:09.9916748Z #define __USE_POSIX199506 1
2025-05-07T20:27:09.9917006Z #define _FEATURES_H 1
2025-05-07T20:27:09.9917247Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:27:09.9917646Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:27:09.9918133Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:27:09.9918469Z #define __stub_getmsg 
2025-05-07T20:27:09.9918704Z #define _IO_FIXED 010000
2025-05-07T20:27:09.9918991Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:27:09.9919309Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:27:09.9919585Z #define __stub_setlogin 
2025-05-07T20:27:09.9919823Z #define __stub_fattach 
2025-05-07T20:27:09.9920065Z #define __cplusplus 201703L
2025-05-07T20:27:09.9920424Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:27:09.9920704Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:27:09.9920963Z #define INFINITY (__builtin_inff())
2025-05-07T20:27:09.9921245Z #define _IO_UNBUFFERED 2
2025-05-07T20:27:09.9921734Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:27:09.9922266Z #define _IO_INTERNAL 010
2025-05-07T20:27:09.9922518Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:27:09.9922862Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:09.9923221Z #define __dev_t_defined 
2025-05-07T20:27:09.9923469Z #define __DEPRECATED 1
2025-05-07T20:27:09.9923704Z #define __S32_TYPE int
2025-05-07T20:27:09.9923964Z #define __cpp_rvalue_references 200610L
2025-05-07T20:27:09.9924276Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:27:09.9924542Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:27:09.9924801Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:27:09.9925419Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:27:09.9926064Z #define _G_HAVE_MREMAP 1
2025-05-07T20:27:09.9926376Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:09.9926722Z #define OVERFLOW 3
2025-05-07T20:27:09.9926979Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:27:09.9927290Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:27:09.9927582Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.9928115Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:27:09.9928454Z #define __SSE2_MATH__ 1
2025-05-07T20:27:09.9928710Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:27:09.9929035Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:09.9929463Z #define _IO_STDIO_H 
2025-05-07T20:27:09.9929707Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:27:09.9930011Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:27:09.9930343Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:27:09.9930646Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:09.9930961Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:27:09.9931235Z #define __amd64 1
2025-05-07T20:27:09.9931458Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:27:09.9931730Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:27:09.9932011Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:27:09.9932298Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:27:09.9932611Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:27:09.9932892Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:27:09.9933192Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:27:09.9933460Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:27:09.9933714Z #define __bounded 
2025-05-07T20:27:09.9933948Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:27:09.9934217Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:27:09.9934514Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:27:09.9934801Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:27:09.9935066Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:27:09.9935346Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:09.9935673Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:27:09.9936094Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:09.9936504Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:27:09.9936785Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:27:09.9937129Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:27:09.9937481Z #define STA_PLL 0x0001
2025-05-07T20:27:09.9937735Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:27:09.9938007Z #define __GNUG__ 11
2025-05-07T20:27:09.9938242Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:27:09.9938513Z #define _T_WCHAR 
2025-05-07T20:27:09.9938759Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:27:09.9939046Z #define __specialization_static 
2025-05-07T20:27:09.9939350Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:27:09.9939670Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:27:09.9939931Z #define cudaArraySparse 0x40
2025-05-07T20:27:09.9940202Z #define STA_PPSFREQ 0x0002
2025-05-07T20:27:09.9940490Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:27:09.9940790Z #define _WCHAR_T 
2025-05-07T20:27:09.9941019Z #define __cudaCDP2Free 
2025-05-07T20:27:09.9941674Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:27:09.9942369Z #define __cpp_nsdmi 200809L
2025-05-07T20:27:09.9942788Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:27:09.9943241Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:27:09.9943528Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:27:09.9943791Z #define cudaArrayCubemap 0x04
2025-05-07T20:27:09.9944129Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:09.9944490Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:27:09.9944734Z #define __NO_CTYPE 1
2025-05-07T20:27:09.9944964Z #define __stub_bdflush 
2025-05-07T20:27:09.9945330Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:27:09.9945760Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:27:09.9946070Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:27:09.9946345Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:27:09.9946627Z #define __cpp_initializer_lists 200806L
2025-05-07T20:27:09.9946933Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:27:09.9947327Z #define __U16_TYPE unsigned short int
2025-05-07T20:27:09.9947676Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:27:09.9948025Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:27:09.9948391Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:27:09.9948679Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:27:09.9949024Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:27:09.9949378Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:27:09.9949668Z #define _IO_STDIO 040000
2025-05-07T20:27:09.9949996Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:27:09.9950390Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:27:09.9950714Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:27:09.9951009Z #define _PTRDIFF_T 
2025-05-07T20:27:09.9951226Z #define _MOVE_H 1
2025-05-07T20:27:09.9951460Z #define __cpp_hex_float 201603L
2025-05-07T20:27:09.9951727Z #define ADJ_TAI 0x0080
2025-05-07T20:27:09.9951954Z #define __ptrvalue 
2025-05-07T20:27:09.9952188Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:27:09.9952446Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:27:09.9952732Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:27:09.9953047Z #define MATH_ERREXCEPT 2
2025-05-07T20:27:09.9953428Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:27:09.9960739Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:27:09.9961170Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:27:09.9961618Z #define __USE_GNU 1
2025-05-07T20:27:09.9961855Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:27:09.9962140Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:27:09.9962419Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:27:09.9962812Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:27:09.9963210Z #define WEXITED 4
2025-05-07T20:27:09.9963433Z #define _IO_NO_READS 4
2025-05-07T20:27:09.9963738Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:27:09.9964102Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:27:09.9964392Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:27:09.9964701Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:27:09.9965025Z #define __uid_t_defined 
2025-05-07T20:27:09.9965290Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:27:09.9965587Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:27:09.9965863Z #define WNOHANG 1
2025-05-07T20:27:09.9966116Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:27:09.9966437Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:27:09.9966712Z #define cudaEventDefault 0x00
2025-05-07T20:27:09.9967013Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:27:09.9967341Z #define NL_SETMAX INT_MAX
2025-05-07T20:27:09.9967582Z #define __x86_64 1
2025-05-07T20:27:09.9967815Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:27:09.9968224Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:09.9968715Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:27:09.9969225Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:09.9969672Z #define __PTRDIFF_T 
2025-05-07T20:27:09.9970016Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:27:09.9970399Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:27:09.9970683Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:09.9970980Z #define _Mlong_double_ long double
2025-05-07T20:27:09.9971268Z #define __cpp_lambdas 200907L
2025-05-07T20:27:09.9971529Z #define _IO_DEC 020
2025-05-07T20:27:09.9971757Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:27:09.9972033Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:27:09.9972332Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:27:09.9972619Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:27:09.9972886Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:27:09.9973193Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:27:09.9973642Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:27:09.9973926Z #define _ANSI_STDDEF_H 
2025-05-07T20:27:09.9974205Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:27:09.9974526Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:27:09.9974985Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:27:09.9975383Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:27:09.9975674Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:27:09.9975968Z #define __cpp_template_auto 201606L
2025-05-07T20:27:09.9976337Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:27:09.9976713Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:27:09.9976986Z #define __key_t_defined 
2025-05-07T20:27:09.9977242Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:27:09.9977625Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:27:09.9978104Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:27:09.9978488Z #define __GNUC_VA_LIST 
2025-05-07T20:27:09.9978833Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:09.9979224Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:27:09.9979496Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:27:09.9979785Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:27:09.9980086Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:27:09.9980339Z #define __WCOREFLAG 0x80
2025-05-07T20:27:09.9980599Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:27:09.9980911Z #define cudaEventDisableTiming 0x02
2025-05-07T20:27:09.9981190Z #define __LP64__ 1
2025-05-07T20:27:09.9981440Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:27:09.9981767Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:27:09.9982052Z #define _IO_off64_t __off64_t
2025-05-07T20:27:09.9982317Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:09.9982591Z #define __time_t_defined 1
2025-05-07T20:27:09.9982848Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:27:09.9983213Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:27:09.9983590Z #define __USE_UNIX98 1
2025-05-07T20:27:09.9983838Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:27:09.9984118Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:27:09.9984395Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:27:09.9984706Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:27:09.9985025Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:27:09.9985292Z #define SEEK_CUR 1
2025-05-07T20:27:09.9985529Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:09.9985803Z #define _ASSERT_H 1
2025-05-07T20:27:09.9986389Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:27:09.9987042Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:27:09.9987326Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:27:09.9987585Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:27:09.9987860Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:27:09.9988147Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:27:09.9988532Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:09.9988954Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:27:09.9989641Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:27:09.9990303Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:27:09.9990610Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:27:09.9990972Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:27:09.9991365Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:27:09.9991638Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:27:09.9991931Z #define cudaArrayDefault 0x00
2025-05-07T20:27:09.9992218Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:27:09.9992516Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:27:09.9992920Z #define TLOSS 5
2025-05-07T20:27:09.9993146Z #define __ssize_t_defined 
2025-05-07T20:27:09.9993400Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:27:09.9993685Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:27:09.9994066Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:27:09.9994349Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:27:09.9994641Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:27:09.9994935Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:27:09.9995248Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:27:09.9995548Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:27:09.9995848Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:27:09.9996139Z #define __REGISTER_PREFIX__ 
2025-05-07T20:27:09.9996396Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:27:09.9996736Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:27:09.9997105Z #define _IOS_NOREPLACE 64
2025-05-07T20:27:09.9997347Z #define __cdecl 
2025-05-07T20:27:09.9997589Z #define cudaEventInterprocess 0x04
2025-05-07T20:27:09.9997931Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:27:09.9998264Z #define LOGIN_NAME_MAX 256
2025-05-07T20:27:09.9998522Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:27:09.9998806Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:27:09.9999102Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:27:09.9999373Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:27:09.9999688Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:27:10.0000026Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:27:10.0000498Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:10.0000940Z #define ADJ_NANO 0x2000
2025-05-07T20:27:10.0001249Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:27:10.0001655Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:27:10.0001948Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:27:10.0002217Z #define __FLT_DIG__ 6
2025-05-07T20:27:10.0002576Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:27:10.0002984Z #define __NO_INLINE__ 1
2025-05-07T20:27:10.0003299Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:10.0003660Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:27:10.0003920Z #define ADJ_STATUS 0x0010
2025-05-07T20:27:10.0004186Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:27:10.0004483Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:27:10.0004755Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:10.0005062Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:27:10.0005356Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:27:10.0005741Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:27:10.0006166Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:27:10.0006519Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:27:10.0006867Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:27:10.0007117Z #define MAX_CANON 255
2025-05-07T20:27:10.0007358Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:27:10.0007612Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:27:10.0007887Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:27:10.0008186Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:27:10.0008503Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:27:10.0008803Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:27:10.0009087Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:27:10.0009415Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:27:10.0009731Z #define __VERSION__ "11.4.0"
2025-05-07T20:27:10.0009997Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:27:10.0010294Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:27:10.0010585Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:27:10.0010872Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:27:10.0011196Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:27:10.0011499Z #define __UINT64_C(c) c ## UL
2025-05-07T20:27:10.0011761Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:27:10.0012106Z #define _SYS_TYPES_H 1
2025-05-07T20:27:10.0012346Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:27:10.0012612Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:27:10.0012868Z #define _SYS_CDEFS_H 1
2025-05-07T20:27:10.0013182Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:27:10.0013817Z #define __cpp_unicode_characters 201411L
2025-05-07T20:27:10.0014123Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:27:10.0014385Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:27:10.0014684Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:27:10.0014957Z #define FP_SUBNORMAL 3
2025-05-07T20:27:10.0015208Z #define cudaOccupancyDefault 0x00
2025-05-07T20:27:10.0015489Z #define _INITIALIZER_LIST 
2025-05-07T20:27:10.0015743Z #define _STDC_PREDEF_H 1
2025-05-07T20:27:10.0016006Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:27:10.0016295Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:27:10.0016559Z #define _IO_file_flags _flags
2025-05-07T20:27:10.0016820Z #define __USE_XOPEN2K8 1
2025-05-07T20:27:10.0017074Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:27:10.0017354Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:27:10.0017633Z #define HUGE 3.40282347e+38F
2025-05-07T20:27:10.0017898Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:27:10.0018296Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:27:10.0018697Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:27:10.0019006Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:27:10.0019280Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:27:10.0019538Z #define _BSD_SOURCE 1
2025-05-07T20:27:10.0019780Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:27:10.0020635Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:27:10.0021504Z #define __catch(X) catch(X)
2025-05-07T20:27:10.0021768Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:27:10.0022072Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:27:10.0022346Z #define __TIMER_T_TYPE void *
2025-05-07T20:27:10.0022602Z #define __STRING(x) #x
2025-05-07T20:27:10.0022845Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:27:10.0023129Z #define _T_PTRDIFF_ 
2025-05-07T20:27:10.0023372Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:27:10.0023683Z #define cudaEventWaitExternal 0x01
2025-05-07T20:27:10.0023961Z #define __unbounded 
2025-05-07T20:27:10.0024203Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:10.0024500Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:27:10.0024784Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:10.0025083Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:27:10.0025363Z #define __cpp_lib_is_final 201402L
2025-05-07T20:27:10.0025669Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:27:10.0026000Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:27:10.0026315Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:27:10.0026607Z #define __managed__ __location__(managed)
2025-05-07T20:27:10.0026907Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:27:10.0027317Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:10.0027748Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:27:10.0028006Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:27:10.0028383Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:27:10.0028793Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:27:10.0029049Z #define _SYS_SIZE_T_H 
2025-05-07T20:27:10.0029342Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:27:10.0029685Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:27:10.0029969Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:27:10.0030264Z #define _CRTIMP 
2025-05-07T20:27:10.0030490Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:27:10.0030802Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:10.0031133Z #define STA_PPSJITTER 0x0200
2025-05-07T20:27:10.0031696Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:27:10.0032120Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:10.0032560Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:27:10.0032840Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:27:10.0033251Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:27:10.0033640Z #define __SIZE_T__ 
2025-05-07T20:27:10.0033851Z #define __stub_gtty 
2025-05-07T20:27:10.0034083Z #define __pid_t_defined 
2025-05-07T20:27:10.0034341Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:27:10.0034646Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:10.0034971Z #define __glibcxx_function_requires(...) 
2025-05-07T20:27:10.0035270Z #define __SM_80_RT_HPP__ 
2025-05-07T20:27:10.0035509Z #define __need_clockid_t 
2025-05-07T20:27:10.0035756Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:27:10.0036016Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:27:10.0036345Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:27:10.0036669Z #define _IO_HEX 0100
2025-05-07T20:27:10.0036931Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:27:10.0037279Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:27:10.0037381Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:27:10.0037483Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:27:10.0037711Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:10.0037831Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:27:10.0037936Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:27:10.0038044Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:27:10.0038151Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:27:10.0038254Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:27:10.0038342Z #define __stub_sstk 
2025-05-07T20:27:10.0038433Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:27:10.0038590Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:27:10.0038676Z #define __wur 
2025-05-07T20:27:10.0038795Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:27:10.0038888Z #define _G_HAVE_MMAP 1
2025-05-07T20:27:10.0038970Z #define _IO_OCT 040
2025-05-07T20:27:10.0039074Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:27:10.0039168Z #define NL_MSGMAX INT_MAX
2025-05-07T20:27:10.0039259Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:27:10.0039387Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:27:10.0039485Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:27:10.0039589Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:27:10.0039783Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:27:10.0039880Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:27:10.0039971Z #define _STL_ALGOBASE_H 1
2025-05-07T20:27:10.0040081Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:27:10.0040247Z #define __off64_t_defined 
2025-05-07T20:27:10.0040348Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:27:10.0040440Z #define __FLT128_DIG__ 33
2025-05-07T20:27:10.0040549Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:27:10.0040647Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:27:10.0040735Z #define __INT32_C(c) c
2025-05-07T20:27:10.0040838Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:27:10.0040936Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:27:10.0041036Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:27:10.0041127Z #define __PDP_ENDIAN 3412
2025-05-07T20:27:10.0041218Z #define _ISOC95_SOURCE 1
2025-05-07T20:27:10.0041317Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:27:10.0041447Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:27:10.0041557Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:27:10.0041664Z #define __SM_90_RT_HPP__ 
2025-05-07T20:27:10.0041774Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:27:10.0041885Z #define __have_pthread_attr_t 1
2025-05-07T20:27:10.0041985Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:27:10.0042301Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:27:10.0042424Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:27:10.0042529Z #define __cudaCDP2EventRecord 
2025-05-07T20:27:10.0042625Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:27:10.0042871Z #define htole32(x) (x)
2025-05-07T20:27:10.0043129Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:27:10.0043256Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:27:10.0043357Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:27:10.0043514Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:27:10.0043660Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:27:10.0043786Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:27:10.0043926Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:27:10.0044022Z #define ADJ_OFFSET 0x0001
2025-05-07T20:27:10.0044123Z #define cudaArrayLayered 0x01
2025-05-07T20:27:10.0044292Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:27:10.0044410Z #define cudaEventRecordDefault 0x00
2025-05-07T20:27:10.0044507Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:27:10.0044608Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:27:10.0044693Z #define unix 1
2025-05-07T20:27:10.0044786Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:27:10.0044883Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:27:10.0044977Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:27:10.0045096Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:27:10.0045185Z #define __USE_POSIX 1
2025-05-07T20:27:10.0045279Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:27:10.0045410Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:27:10.0045508Z #define __THROWNL throw ()
2025-05-07T20:27:10.0045597Z #define __cpp_rtti 199711L
2025-05-07T20:27:10.0045700Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:27:10.0045793Z #define __PMT(args) args
2025-05-07T20:27:10.0045907Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:10.0046064Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:27:10.0046179Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:27:10.0046270Z #define _SIZE_T_DECLARED 
2025-05-07T20:27:10.0046371Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:27:10.0046469Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:27:10.0046872Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:27:10.0046976Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:27:10.0047071Z #define XATTR_LIST_MAX 65536
2025-05-07T20:27:10.0047166Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:27:10.0047313Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:27:10.0047395Z #define _WCHAR_T_H 
2025-05-07T20:27:10.0047489Z #define __FLT64X_DIG__ 18
2025-05-07T20:27:10.0047580Z #define _IO_SHOWBASE 0200
2025-05-07T20:27:10.0047668Z #define _POSIX_QLIMIT 1
2025-05-07T20:27:10.0047772Z #define __INT8_TYPE__ signed char
2025-05-07T20:27:10.0047870Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:27:10.0047957Z #define __CUDA_ARCH__ 520
2025-05-07T20:27:10.0048070Z #define __cpp_digit_separators 201309L
2025-05-07T20:27:10.0048151Z #define __ELF__ 1
2025-05-07T20:27:10.0048257Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:27:10.0048360Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:27:10.0048447Z #define STA_INS 0x0010
2025-05-07T20:27:10.0048545Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:27:10.0048726Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:27:10.0048820Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:27:10.0048919Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:27:10.0049028Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:10.0049137Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:27:10.0049238Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:27:10.0049341Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:27:10.0049440Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:27:10.0049718Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:27:10.0049879Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:27:10.0049980Z #define _IO_funlockfile(_fp) 
2025-05-07T20:27:10.0050388Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:10.0050518Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:27:10.0050618Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:27:10.0050705Z #define __FLT_RADIX__ 2
2025-05-07T20:27:10.0050807Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:27:10.0050976Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:27:10.0051073Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:27:10.0051167Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:27:10.0051275Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:27:10.0051372Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:27:10.0051470Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:27:10.0051584Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:27:10.0051668Z #define WORD_BIT 32
2025-05-07T20:27:10.0051757Z #define _IO_USER_BUF 1
2025-05-07T20:27:10.0051849Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:27:10.0051957Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:10.0052069Z #define cudaHostAllocPortable 0x01
2025-05-07T20:27:10.0052168Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:27:10.0052269Z #define __long_double_t long double
2025-05-07T20:27:10.0052369Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:27:10.0052461Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:27:10.0052868Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:27:10.0052953Z #define __k8 1
2025-05-07T20:27:10.0053151Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:27:10.0053322Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:27:10.0053441Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:27:10.0053546Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:27:10.0053649Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:27:10.0053749Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:27:10.0053848Z #define __blksize_t_defined 
2025-05-07T20:27:10.0053944Z #define _IO_SHOWPOINT 0400
2025-05-07T20:27:10.0054043Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:27:10.0054156Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:27:10.0054255Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:27:10.0054361Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:27:10.0054457Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:27:10.0054557Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:27:10.0054816Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:27:10.0055167Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:27:10.0055269Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:27:10.0055371Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:27:10.0055458Z #define SEEK_SET 0
2025-05-07T20:27:10.0055556Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:27:10.0055652Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:27:10.0055858Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:27:10.0055961Z #define __cudaCDP2GetLastError 
2025-05-07T20:27:10.0056056Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:27:10.0056151Z #define _MATH_H_MATHDEF 1
2025-05-07T20:27:10.0056476Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:27:10.0056583Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:27:10.0056682Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:27:10.0056771Z #define __stub_sigreturn 
2025-05-07T20:27:10.0057018Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:27:10.0057115Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:27:10.0057294Z #define __HOST_CONFIG_H__ 
2025-05-07T20:27:10.0057401Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:27:10.0057487Z #define CLOCK_TAI 11
2025-05-07T20:27:10.0057594Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:27:10.0057885Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:27:10.0057974Z #define __restrict_arr 
2025-05-07T20:27:10.0058090Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:27:10.0058233Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:27:10.0058769Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:10.0058960Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:27:10.0059051Z #define __USE_MISC 1
2025-05-07T20:27:10.0059154Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:27:10.0059264Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:27:10.0059351Z #define _GCC_LIMITS_H_ 
2025-05-07T20:27:10.0059441Z #define __LDBL_DIG__ 18
2025-05-07T20:27:10.0059536Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:27:10.0059644Z #define __malloc_and_calloc_defined 
2025-05-07T20:27:10.0059741Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:27:10.0059845Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:27:10.0059928Z #define __x86_64__ 1
2025-05-07T20:27:10.0060016Z #define _SIZE_T_ 
2025-05-07T20:27:10.0060920Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:27:10.0061026Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:27:10.0061125Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:27:10.0061242Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:27:10.0061361Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:27:10.0061456Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:27:10.0061570Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:27:10.0061694Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:27:10.0061836Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:27:10.0061937Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:27:10.0062412Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:10.0062538Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:27:10.0062690Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:27:10.0062790Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:27:10.0062884Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:27:10.0062976Z #define STA_FLL 0x0008
2025-05-07T20:27:10.0063126Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:27:10.0063223Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:27:10.0063348Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0063464Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:27:10.0063550Z #define __stub_revoke 
2025-05-07T20:27:10.0063645Z #define __timer_t_defined 1
2025-05-07T20:27:10.0063779Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:27:10.0063874Z #define INT_MAX __INT_MAX__
2025-05-07T20:27:10.0063981Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:27:10.0064086Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:27:10.0064185Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:27:10.0064287Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:27:10.0064397Z #define cudaArrayTextureGather 0x08
2025-05-07T20:27:10.0064498Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:27:10.0064644Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:27:10.0064844Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:27:10.0064942Z #define _IO_off_t __off_t
2025-05-07T20:27:10.0065033Z #define __FLT64_DIG__ 15
2025-05-07T20:27:10.0065261Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:27:10.0065434Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:27:10.0065563Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:10.0065689Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:27:10.0065784Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:27:10.0065886Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:27:10.0065976Z #define NULL __null
2025-05-07T20:27:10.0066108Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:27:10.0066212Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:27:10.0066317Z #define __U64_TYPE unsigned long int
2025-05-07T20:27:10.0066412Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:27:10.0066504Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:27:10.0066593Z #define FP_ZERO 2
2025-05-07T20:27:10.0066698Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:27:10.0066856Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:27:10.0066965Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0067053Z #define __WCHAR_T__ 
2025-05-07T20:27:10.0067150Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:27:10.0067348Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:27:10.0067502Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:27:10.0067605Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:27:10.0067727Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:27:10.0067843Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:27:10.0067976Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:27:10.0068103Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:27:10.0068200Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:27:10.0068291Z #define _SIGSET_H_types 1
2025-05-07T20:27:10.0068413Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:27:10.0068521Z #define __cpp_unicode_literals 200710L
2025-05-07T20:27:10.0068671Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:27:10.0068781Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:27:10.0068904Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:27:10.0069035Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:27:10.0069143Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:27:10.0069274Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:27:10.0069389Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:27:10.0069568Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:27:10.0069662Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:27:10.0069765Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:27:10.0069869Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:27:10.0069961Z #define STA_MODE 0x4000
2025-05-07T20:27:10.0070075Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:27:10.0070182Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:27:10.0070297Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:27:10.0070398Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:27:10.0070504Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:27:10.0070609Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:27:10.0070703Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:27:10.0070819Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:27:10.0070908Z #define __SIZE_WIDTH__ 64
2025-05-07T20:27:10.0071033Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:10.0071113Z #define __SEG_FS 1
2025-05-07T20:27:10.0071203Z #define _IO_size_t size_t
2025-05-07T20:27:10.0071306Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:27:10.0071402Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:27:10.0071486Z #define __stub_lchmod 
2025-05-07T20:27:10.0071640Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:27:10.0071809Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0072024Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:27:10.0072164Z #define __SEG_GS 1
2025-05-07T20:27:10.0080096Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:27:10.0080418Z #define _IOS_APPEND 8
2025-05-07T20:27:10.0080521Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:27:10.0080618Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:27:10.0080726Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:27:10.0080829Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:27:10.0080943Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:27:10.0081031Z #define htole16(x) (x)
2025-05-07T20:27:10.0081146Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:10.0081250Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:27:10.0081360Z #define __INT16_TYPE__ short int
2025-05-07T20:27:10.0081476Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:27:10.0081612Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:27:10.0081725Z #define __cpp_structured_bindings 201606L
2025-05-07T20:27:10.0081854Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:27:10.0081947Z #define __SIZEOF_INT__ 4
2025-05-07T20:27:10.0082036Z #define __WCLONE 0x80000000
2025-05-07T20:27:10.0082136Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:27:10.0082222Z #define SEEK_HOLE 4
2025-05-07T20:27:10.0082309Z #define TIMER_ABSTIME 1
2025-05-07T20:27:10.0082406Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:27:10.0082496Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:27:10.0082674Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:10.0082796Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0082894Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:27:10.0083005Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:27:10.0083106Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:27:10.0083229Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:27:10.0083322Z #define _LINUX_LIMITS_H 
2025-05-07T20:27:10.0083404Z #define linux 1
2025-05-07T20:27:10.0083504Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:27:10.0083618Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:27:10.0083718Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:27:10.0083826Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:27:10.0083932Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:27:10.0084083Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:27:10.0084186Z #define __cpp_lib_hypot 201603
2025-05-07T20:27:10.0084282Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:27:10.0084380Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:27:10.0084475Z #define MOD_NANO ADJ_NANO
2025-05-07T20:27:10.0084561Z #define htole64(x) (x)
2025-05-07T20:27:10.0084661Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:27:10.0084789Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:27:10.0084884Z #define _IO_UPPERCASE 01000
2025-05-07T20:27:10.0085389Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:27:10.0085481Z #define __USE_POSIX2 1
2025-05-07T20:27:10.0085581Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:27:10.0085671Z #define __WALL 0x40000000
2025-05-07T20:27:10.0085775Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:27:10.0085859Z #define _XLOCALE_H 1
2025-05-07T20:27:10.0085959Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:27:10.0086058Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:27:10.0086154Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:27:10.0086263Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:27:10.0086349Z #define __EXCEPTIONS 1
2025-05-07T20:27:10.0086455Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:27:10.0086653Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:27:10.0086741Z #define __WORDSIZE 64
2025-05-07T20:27:10.0086835Z #define CLOCK_MONOTONIC 1
2025-05-07T20:27:10.0086923Z #define _STL_RELOPS_H 1
2025-05-07T20:27:10.0087021Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:27:10.0087286Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:27:10.0087387Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:27:10.0087479Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:27:10.0087579Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:27:10.0087961Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:27:10.0088198Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:10.0088323Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:27:10.0088426Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:27:10.0088530Z #define __cpp_range_based_for 201603L
2025-05-07T20:27:10.0088642Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:27:10.0088747Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:27:10.0088855Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:27:10.0089038Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:27:10.0089142Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:27:10.0089239Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:27:10.0089344Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:27:10.0089524Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:10.0089645Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:27:10.0089729Z #define _STRING_H 1
2025-05-07T20:27:10.0089830Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:27:10.0089918Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:27:10.0090023Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:27:10.0090158Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:27:10.0090254Z #define __code_model_small__ 1
2025-05-07T20:27:10.0090346Z #define _PSTL_CONFIG_H 
2025-05-07T20:27:10.0090447Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:27:10.0090563Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:27:10.0090662Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:27:10.0090765Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:27:10.0091118Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:10.0091211Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:27:10.0091300Z #define le64toh(x) (x)
2025-05-07T20:27:10.0091398Z #define FILENAME_MAX 4096
2025-05-07T20:27:10.0091552Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:27:10.0091665Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:27:10.0091754Z #define L_cuserid 9
2025-05-07T20:27:10.0091843Z #define __ino_t_defined 
2025-05-07T20:27:10.0091924Z #define __k8__ 1
2025-05-07T20:27:10.0092027Z #define __INTPTR_TYPE__ long int
2025-05-07T20:27:10.0092136Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:27:10.0092225Z #define __int8_t_defined 
2025-05-07T20:27:10.0092318Z #define __WCHAR_TYPE__ int
2025-05-07T20:27:10.0092419Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:27:10.0092537Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:27:10.0092637Z #define __SLONGWORD_TYPE long int
2025-05-07T20:27:10.0092760Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:27:10.0092918Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:27:10.0093003Z #define __HAVE_COLUMN 
2025-05-07T20:27:10.0093096Z #define __stub_fdetach 
2025-05-07T20:27:10.0093522Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:27:10.0093604Z #define __pic__ 2
2025-05-07T20:27:10.0093723Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:10.0093824Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:27:10.0093917Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:27:10.0094024Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:27:10.0094112Z #define __stub_chflags 
2025-05-07T20:27:10.0094201Z #define CLOCK_BOOTTIME 7
2025-05-07T20:27:10.0094288Z #define __need_IOV_MAX 
2025-05-07T20:27:10.0094397Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:27:10.0094500Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:27:10.0094680Z #define __cpp_decltype 200707L
2025-05-07T20:27:10.0094781Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:27:10.0094873Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:27:10.0095058Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:27:10.0095145Z #define TTY_NAME_MAX 32
2025-05-07T20:27:10.0095315Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:27:10.0095436Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0095606Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:27:10.0095718Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:27:10.0095810Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:27:10.0095901Z #define STA_PPSTIME 0x0004
2025-05-07T20:27:10.0095986Z #define __import__ 
2025-05-07T20:27:10.0096074Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:27:10.0096207Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:27:10.0096295Z #define __export__ 
2025-05-07T20:27:10.0096417Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:27:10.0096516Z #define cudaMemAttachHost 0x02
2025-05-07T20:27:10.0096682Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:10.0096781Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:27:10.0096875Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:27:10.0096969Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:27:10.0097060Z #define _WCHAR_T_DECLARED 
2025-05-07T20:27:10.0097183Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:27:10.0097298Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:27:10.0097403Z #define __cpp_inline_variables 201606L
2025-05-07T20:27:10.0097497Z #define WNOWAIT 0x01000000
2025-05-07T20:27:10.0097578Z #define PLOSS 6
2025-05-07T20:27:10.0097668Z #define M_LN10 2.30258509299404568402
2025-05-07T20:27:10.0097939Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:27:10.0098025Z #define EXIT_SUCCESS 0
2025-05-07T20:27:10.0098131Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:27:10.0098226Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:27:10.0098325Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:27:10.0098419Z #define __thread__ __thread
2025-05-07T20:27:10.0098520Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:27:10.0098611Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:27:10.0098717Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:27:10.0098946Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:10.0099058Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:27:10.0099155Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:27:10.0099235Z #define __linux__ 1
2025-05-07T20:27:10.0099330Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:27:10.0099459Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:27:10.0099550Z #define __S16_TYPE short int
2025-05-07T20:27:10.0099908Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:27:10.0100021Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:27:10.0100214Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:27:10.0100320Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:27:10.0100417Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:27:10.0100496Z #define _T_SIZE_ 
2025-05-07T20:27:10.0100598Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:10.0100717Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:27:10.0100809Z #define _PSTL_VERSION 12000
2025-05-07T20:27:10.0100932Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:27:10.0101024Z #define __WNOTHREAD 0x20000000
2025-05-07T20:27:10.0101123Z #define _G_va_list __gnuc_va_list
2025-05-07T20:27:10.0101252Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:27:10.0101337Z #define _IOS_INPUT 1
2025-05-07T20:27:10.0101432Z #define __USE_LARGEFILE64 1
2025-05-07T20:27:10.0101536Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:27:10.0101716Z #define __INT64_TYPE__ long int
2025-05-07T20:27:10.0101817Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:27:10.0101915Z #define __shared__ __location__(shared)
2025-05-07T20:27:10.0102007Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:27:10.0102240Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:27:10.0102327Z #define __gid_t_defined 
2025-05-07T20:27:10.0102442Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:27:10.0102538Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:27:10.0102739Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:27:10.0102839Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:27:10.0102928Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:27:10.0103013Z #define ___int_size_t_h 
2025-05-07T20:27:10.0103123Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:10.0103245Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:27:10.0103402Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:27:10.0103515Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:27:10.0103610Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:27:10.0103707Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:27:10.0103805Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:27:10.0103934Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0104052Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:27:10.0104172Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:27:10.0104263Z #define __clock_t_defined 1
2025-05-07T20:27:10.0104365Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:27:10.0104475Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:27:10.0104563Z #define __GLIBC_MINOR__ 17
2025-05-07T20:27:10.0104657Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:27:10.0104753Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:27:10.0104861Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:27:10.0104958Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:27:10.0105134Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:10.0105220Z #define __SSE__ 1
2025-05-07T20:27:10.0105314Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:27:10.0105410Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:27:10.0105500Z #define _CTYPE_H 1
2025-05-07T20:27:10.0105589Z #define __sigset_t_defined 
2025-05-07T20:27:10.0105694Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:27:10.0105837Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:27:10.0105958Z #define MOD_TAI ADJ_TAI
2025-05-07T20:27:10.0106079Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:27:10.0106176Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:27:10.0106259Z #define __SM_70_RT_H__ 
2025-05-07T20:27:10.0106354Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:27:10.0106461Z #define cudaEventWaitDefault 0x00
2025-05-07T20:27:10.0106556Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:27:10.0106721Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:10.0106815Z #define _POSIX_MAX_CANON 255
2025-05-07T20:27:10.0106928Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:27:10.0107027Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:27:10.0107118Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:27:10.0107198Z #define __amd64__ 1
2025-05-07T20:27:10.0107294Z #define __WINT_WIDTH__ 32
2025-05-07T20:27:10.0107396Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:27:10.0107667Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:10.0107775Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:27:10.0107856Z #define EOF (-1)
2025-05-07T20:27:10.0107953Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:27:10.0108048Z #define __USE_POSIX199309 1
2025-05-07T20:27:10.0108142Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:27:10.0108238Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:27:10.0108331Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:27:10.0108425Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:27:10.0108540Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:27:10.0108725Z #define ____mbstate_t_defined 1
2025-05-07T20:27:10.0108821Z #define STA_NANO 0x2000
2025-05-07T20:27:10.0108919Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:27:10.0109013Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:27:10.0109204Z #define _IO_LINKED 0x80
2025-05-07T20:27:10.0109304Z #define __cpp_lib_launder 201606
2025-05-07T20:27:10.0109393Z #define __SIZEOF_INT128__ 16
2025-05-07T20:27:10.0109492Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:27:10.0109589Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:27:10.0109681Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:27:10.0109828Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:27:10.0109936Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:10.0110038Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:10.0110142Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:27:10.0110235Z #define __W_CONTINUED 0xffff
2025-05-07T20:27:10.0110324Z #define __ATOMIC_RELAXED 0
2025-05-07T20:27:10.0110461Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:27:10.0110589Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:10.0110793Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:27:10.0110987Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:27:10.0111070Z #define __stub_stty 
2025-05-07T20:27:10.0111241Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:27:10.0111326Z #define le16toh(x) (x)
2025-05-07T20:27:10.0111431Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:27:10.0111609Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:27:10.0111689Z #define _SIZET_ 
2025-05-07T20:27:10.0111780Z #define XATTR_NAME_MAX 255
2025-05-07T20:27:10.0111867Z #define _SVID_SOURCE 1
2025-05-07T20:27:10.0111946Z #define _LP64 1
2025-05-07T20:27:10.0112033Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:27:10.0112273Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:27:10.0112389Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:27:10.0112478Z #define __UINT8_C(c) c
2025-05-07T20:27:10.0112572Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:27:10.0112671Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:27:10.0112783Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:27:10.0112876Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:27:10.0112967Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:27:10.0113067Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:27:10.0113157Z #define CUDARTAPI 
2025-05-07T20:27:10.0113241Z #define IOV_MAX 1024
2025-05-07T20:27:10.0113688Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:27:10.0113787Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:27:10.0113886Z #define P_tmpdir "/tmp"
2025-05-07T20:27:10.0113992Z #define cudaMemAttachSingle 0x04
2025-05-07T20:27:10.0114074Z #define __wchar_t__ 
2025-05-07T20:27:10.0114180Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:27:10.0114262Z #define SEEK_END 2
2025-05-07T20:27:10.0114360Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:27:10.0114539Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:27:10.0114639Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:27:10.0114789Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:27:10.0114882Z #define ____FILE_defined 1
2025-05-07T20:27:10.0114997Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:27:10.0115095Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:27:10.0115184Z #define _ISOC99_SOURCE 1
2025-05-07T20:27:10.0115280Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:27:10.0115530Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:10.0115664Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:27:10.0115747Z #define _IO_RIGHT 04
2025-05-07T20:27:10.0115843Z #define __END_NAMESPACE_STD 
2025-05-07T20:27:10.0116030Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:10.0116273Z #define _GLIBCXX_STD_C std
2025-05-07T20:27:10.0116405Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:27:10.0116503Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:27:10.0116609Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:27:10.0116814Z #define _STDDEF_H_ 
2025-05-07T20:27:10.0117009Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:10.0117112Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:27:10.0117238Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:27:10.0117462Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:27:10.0117584Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:10.0117738Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:27:10.0117868Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:27:10.0117978Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:27:10.0118091Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:27:10.0118196Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:27:10.0118320Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:27:10.0118422Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:27:10.0118520Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:27:10.0118627Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:27:10.0118820Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:27:10.0118918Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:27:10.0119118Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:27:10.0119221Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:27:10.0119323Z #define __STDCPP_THREADS__ 1
2025-05-07T20:27:10.0119479Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:27:10.0119577Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:27:10.0119679Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:27:10.0119784Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:27:10.0119911Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:27:10.0120019Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:27:10.0120194Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:27:10.0120363Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:27:10.0120539Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:27:10.0120636Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:27:10.0120760Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:27:10.0120872Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:27:10.0120972Z #define __location__(a) __annotate__(a)
2025-05-07T20:27:10.0121207Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:27:10.0121304Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:27:10.0121417Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:27:10.0121514Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:27:10.0121601Z #define __STDC_UTF_32__ 1
2025-05-07T20:27:10.0121698Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:27:10.0121793Z #define NAN (__builtin_nanf (""))
2025-05-07T20:27:10.0121891Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:27:10.0121975Z #define __FXSR__ 1
2025-05-07T20:27:10.0122055Z #define _SIZE_T 
2025-05-07T20:27:10.0122158Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:27:10.0122276Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:27:10.0122444Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:10.0122592Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:27:10.0122687Z #define _IO_ssize_t __ssize_t
2025-05-07T20:27:10.0122786Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:27:10.0122971Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:10.0123175Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:27:10.0123264Z #define _GXX_NULLPTR_T 
2025-05-07T20:27:10.0123390Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:27:10.0123473Z #define FOPEN_MAX 16
2025-05-07T20:27:10.0123562Z #define __BIG_ENDIAN 4321
2025-05-07T20:27:10.0123769Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:10.0123868Z #define __suseconds_t_defined 
2025-05-07T20:27:10.0123954Z #define __off_t_defined 
2025-05-07T20:27:10.0124117Z #define stderr stderr
2025-05-07T20:27:10.0124210Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:27:10.0124323Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:27:10.0124425Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:27:10.0124515Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:27:10.0124935Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:27:10.0125031Z #define __mode_t_defined 
2025-05-07T20:27:10.0125116Z #define _GCC_SIZE_T 
2025-05-07T20:27:10.0125212Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:10.0125313Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:27:10.0125426Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:27:10.0125524Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:27:10.0125616Z #define __UINT32_C(c) c ## U
2025-05-07T20:27:10.0125724Z #define __cpp_alias_templates 200704L
2025-05-07T20:27:10.0125828Z #define cudaHostAllocMapped 0x02
2025-05-07T20:27:10.0125938Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:27:10.0126028Z #define _STL_ITERATOR_H 1
2025-05-07T20:27:10.0126107Z #define __size_t__ 
2025-05-07T20:27:10.0126237Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:27:10.0126341Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:27:10.0126450Z #define cudaEventRecordExternal 0x01
2025-05-07T20:27:10.0126604Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:27:10.0126697Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:27:10.0126865Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:27:10.0126951Z #define _ENDIAN_H 1
2025-05-07T20:27:10.0127055Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:27:10.0127150Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:27:10.0127258Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:27:10.0127337Z #define __try try
2025-05-07T20:27:10.0127431Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:27:10.0127528Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:27:10.0127621Z #define __INT8_MAX__ 0x7f
2025-05-07T20:27:10.0127887Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:27:10.0127977Z #define __LONG_WIDTH__ 64
2025-05-07T20:27:10.0128056Z #define __PIC__ 2
2025-05-07T20:27:10.0128174Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:27:10.0128298Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:27:10.0128429Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:27:10.0128530Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:27:10.0128622Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:27:10.0128806Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:10.0128912Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:27:10.0129015Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:27:10.0129113Z #define _IO_uid_t __uid_t
2025-05-07T20:27:10.0129211Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:27:10.0129337Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:27:10.0129439Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:27:10.0129583Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:10.0129685Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:27:10.0129808Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:27:10.0129890Z #define LONG_BIT 64
2025-05-07T20:27:10.0129998Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:27:10.0130101Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:27:10.0130228Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:27:10.0130321Z #define __fsfilcnt_t_defined 
2025-05-07T20:27:10.0130414Z #define __blkcnt_t_defined 
2025-05-07T20:27:10.0130687Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:10.0130863Z #define __USE_LARGEFILE 1
2025-05-07T20:27:10.0130963Z #define __cpp_constexpr 201603L
2025-05-07T20:27:10.0131057Z #define CUDART_VERSION 12080
2025-05-07T20:27:10.0131149Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:27:10.0131319Z #define cudaDeviceMapHost 0x08
2025-05-07T20:27:10.0131407Z #define _GLIBCXX_CMATH 1
2025-05-07T20:27:10.0131612Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:27:10.0131703Z #define __lldiv_t_defined 1
2025-05-07T20:27:10.0131783Z #define __SSE2__ 1
2025-05-07T20:27:10.0131867Z #define _IOLBF 1
2025-05-07T20:27:10.0131972Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:27:10.0132067Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:27:10.0132175Z #define __cpp_deduction_guides 201703L
2025-05-07T20:27:10.0132269Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:27:10.0132383Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:27:10.0132473Z #define __INT32_TYPE__ int
2025-05-07T20:27:10.0132561Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:27:10.0132675Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:27:10.0132774Z #define __cpp_exceptions 199711L
2025-05-07T20:27:10.0132869Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:27:10.0132986Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:27:10.0133076Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:27:10.0133191Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:27:10.0133354Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:27:10.0133451Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:27:10.0133548Z #define __SWORD_TYPE long int
2025-05-07T20:27:10.0133639Z #define __INTMAX_TYPE__ long int
2025-05-07T20:27:10.0133734Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:27:10.0133831Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:27:10.0133923Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:27:10.0134206Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:10.0134303Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:27:10.0134454Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:27:10.0134533Z #define _T_SIZE 
2025-05-07T20:27:10.0134646Z #define cudaHostAllocDefault 0x00
2025-05-07T20:27:10.0134793Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:10.0134971Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:27:10.0135082Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:27:10.0135173Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:27:10.0135299Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:27:10.0135397Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:10.0135487Z #define __ATOMIC_CONSUME 1
2025-05-07T20:27:10.0135667Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:27:10.0135755Z #define __GNUC_MINOR__ 4
2025-05-07T20:27:10.0135860Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:27:10.0135954Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:27:10.0136071Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:10.0136158Z #define __PIE__ 2
2025-05-07T20:27:10.0136265Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:27:10.0136363Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:27:10.0136557Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:27:10.0136786Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:10.0136878Z #define __nlink_t_defined 
2025-05-07T20:27:10.0137007Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:27:10.0137120Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:27:10.0137205Z #define _XOPEN_LIM_H 1
2025-05-07T20:27:10.0137472Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:10.0137588Z #define __cpp_template_template_args 201611L
2025-05-07T20:27:10.0137694Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:27:10.0137795Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:27:10.0137887Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:27:10.0138093Z #define __FILE_defined 1
2025-05-07T20:27:10.0138274Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:27:10.0138371Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:27:10.0138540Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:27:10.0138646Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:27:10.0138762Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:27:10.0138874Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:27:10.0138976Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:27:10.0139058Z #define __INT16_C(c) c
2025-05-07T20:27:10.0139157Z #define __U32_TYPE unsigned int
2025-05-07T20:27:10.0139255Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:27:10.0139379Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:27:10.0139460Z #define __STDC__ 1
2025-05-07T20:27:10.0139556Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:27:10.0139656Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:27:10.0139757Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:27:10.0139911Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:27:10.0140004Z #define __FLT32X_DIG__ 15
2025-05-07T20:27:10.0140103Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:27:10.0140206Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:27:10.0140321Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:27:10.0140431Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:27:10.0140528Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:27:10.0140632Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:27:10.0140714Z #define stdin stdin
2025-05-07T20:27:10.0140806Z #define __ino64_t_defined 
2025-05-07T20:27:10.0140891Z #define STA_CLK 0x8000
2025-05-07T20:27:10.0140983Z #define __clockid_t_defined 1
2025-05-07T20:27:10.0141136Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:27:10.0141301Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:27:10.0141404Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:27:10.0141514Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:27:10.0141619Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:27:10.0141722Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:27:10.0141933Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:27:10.0142023Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:27:10.0142562Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:27:10.0142645Z #define DOMAIN 1
2025-05-07T20:27:10.0142736Z #define M_LN2 0.69314718055994530942
2025-05-07T20:27:10.0142821Z #define __NVCC__ 1
2025-05-07T20:27:10.0142922Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:27:10.0143031Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:10.0143137Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:27:10.0143241Z #define __throw_exception_again throw
2025-05-07T20:27:10.0143332Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:27:10.0143424Z #define __EXCEPTION_H 1
2025-05-07T20:27:10.0143518Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:27:10.0143628Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:27:10.0143935Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:10.0144047Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:27:10.0144147Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:27:10.0144240Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:27:10.0144340Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:27:10.0144440Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:27:10.0144580Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:27:10.0144685Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:10.0144798Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:27:10.0144890Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:27:10.0145088Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:27:10.0145185Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:27:10.0145287Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:27:10.0145500Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:27:10.0145595Z #define __useconds_t_defined 
2025-05-07T20:27:10.0145693Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:27:10.0145879Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:27:10.0146030Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:27:10.0146114Z #define __SSE_MATH__ 1
2025-05-07T20:27:10.0146208Z #define _IO_wint_t wint_t
2025-05-07T20:27:10.0146302Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:27:10.0146399Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:27:10.0146493Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:27:10.0146607Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:27:10.0146707Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:27:10.0146803Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:27:10.0146886Z #define __USE_ATFILE 1
2025-05-07T20:27:10.0146981Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:27:10.0147076Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:27:10.0147167Z #define _GCC_PTRDIFF_T 
2025-05-07T20:27:10.0147401Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:10.0147498Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:27:10.0147595Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:27:10.0147700Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:27:10.0147807Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:27:10.0147893Z #define _STDLIB_H 1
2025-05-07T20:27:10.0148032Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:27:10.0148127Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:27:10.0148225Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:27:10.0148352Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:10.0148461Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:10.0148564Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:27:10.0148749Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:27:10.0148903Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:27:10.0149017Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:27:10.0149132Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:27:10.0149228Z #define __ldiv_t_defined 1
2025-05-07T20:27:10.0149411Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:27:10.0149503Z #define ___int_ptrdiff_t_h 
2025-05-07T20:27:10.0149675Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:10.0149778Z #define __cudaCDP2EventDestroy 
2025-05-07T20:27:10.0149869Z #define __HOST_DEFINES_H__ 
2025-05-07T20:27:10.0149973Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:27:10.0150074Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:10.0150173Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:27:10.0150263Z #define CUDART_CB 
2025-05-07T20:27:10.0150364Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:27:10.0150488Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:27:10.0150582Z #define MB_LEN_MAX 16
2025-05-07T20:27:10.0150807Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:10.0150912Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:27:10.0151036Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:27:10.0151148Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:27:10.0151246Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:27:10.0151420Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:27:10.0151542Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:27:10.0151636Z #define _GNU_SOURCE 1
2025-05-07T20:27:10.0151721Z #define __stub_putmsg 
2025-05-07T20:27:10.0151804Z #define __CUDACC__ 1
2025-05-07T20:27:10.0151895Z #define __N(msgid) (msgid)
2025-05-07T20:27:10.0151978Z #define __P(args) args
2025-05-07T20:27:10.0152320Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:27:10.0152422Z #define __cpp_init_captures 201304L
2025-05-07T20:27:10.0152527Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:27:10.0152692Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:27:10.0152788Z #define __cpp_lib_as_const 201510
2025-05-07T20:27:10.0152870Z #define __WCHAR_T 
2025-05-07T20:27:10.0152964Z #define __ATOMIC_RELEASE 3
2025-05-07T20:27:10.0153056Z #define __fsblkcnt_t_defined 
2025-05-07T20:27:10.0153171Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:27:10.0153277Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:27:10.0153283Z 
2025-05-07T20:27:10.0262138Z 
2025-05-07T20:27:10.0262970Z + conda run -n build_binary nvcc --version
2025-05-07T20:27:10.0262990Z 
2025-05-07T20:27:11.9201286Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:11.9201673Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:27:11.9202012Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:27:11.9202411Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:27:11.9202762Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:27:11.9202976Z 
2025-05-07T20:27:11.9826176Z 
2025-05-07T20:27:11.9836686Z /usr/bin/nvidia-smi
2025-05-07T20:27:11.9842503Z + nvidia-smi
2025-05-07T20:27:11.9842684Z 
2025-05-07T20:27:12.0018030Z Wed May  7 20:27:11 2025       
2025-05-07T20:27:12.0018610Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:12.0019227Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:12.0019740Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:12.0020255Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:12.0020808Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:12.0021283Z |                                         |                        |               MIG M. |
2025-05-07T20:27:12.0021636Z |=========================================+========================+======================|
2025-05-07T20:27:12.0189562Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:12.0190047Z |  0%   30C    P8             26W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:12.0190446Z |                                         |                        |                  N/A |
2025-05-07T20:27:12.0190867Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:12.0194415Z                                                                                          
2025-05-07T20:27:12.0194871Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:12.0195327Z | Processes:                                                                              |
2025-05-07T20:27:12.0195784Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:12.0196205Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:12.0196579Z |=========================================================================================|
2025-05-07T20:27:12.0199342Z |  No running processes found                                                             |
2025-05-07T20:27:12.0199833Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:12.2602346Z 
2025-05-07T20:27:12.2608687Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:27:12.2666582Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:27:12.2667151Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:27:12.2680052Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:12.2680504Z env:
2025-05-07T20:27:12.2680742Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:12.2681058Z   BUILD_ENV: build_binary
2025-05-07T20:27:12.2681509Z   BUILD_TARGET: genai
2025-05-07T20:27:12.2681746Z   BUILD_VARIANT: cuda
2025-05-07T20:27:12.2682018Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:27:12.2682299Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:12.2682613Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:12.2682963Z ##[endgroup]
2025-05-07T20:27:12.6051728Z ################################################################################
2025-05-07T20:27:12.6052237Z # Install PyTorch (PIP)
2025-05-07T20:27:12.6052564Z #
2025-05-07T20:27:12.6066840Z # [2025-05-07T20:27:12.606Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:27:12.6067487Z ################################################################################
2025-05-07T20:27:12.6067772Z 
2025-05-07T20:27:12.6095500Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:13.6115230Z Channels:
2025-05-07T20:27:13.6115485Z  - conda-forge
2025-05-07T20:27:13.6115730Z Platform: linux-64
2025-05-07T20:27:16.9660952Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:27:17.6959149Z Solving environment: \ | / done
2025-05-07T20:27:17.9195914Z 
2025-05-07T20:27:17.9196339Z ## Package Plan ##
2025-05-07T20:27:17.9196775Z 
2025-05-07T20:27:17.9197404Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:17.9198279Z 
2025-05-07T20:27:17.9198478Z   added / updated specs:
2025-05-07T20:27:17.9199153Z     - numpy
2025-05-07T20:27:17.9199483Z 
2025-05-07T20:27:17.9199529Z 
2025-05-07T20:27:17.9199877Z The following packages will be downloaded:
2025-05-07T20:27:17.9200643Z 
2025-05-07T20:27:17.9200965Z     package                    |            build
2025-05-07T20:27:17.9201713Z     ---------------------------|-----------------
2025-05-07T20:27:17.9202512Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:17.9203447Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:17.9203975Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:17.9204443Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:17.9204918Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:17.9205577Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:17.9206205Z     numpy-2.2.5                |  py313h17eae1a_0         8.1 MB  conda-forge
2025-05-07T20:27:17.9206611Z     ------------------------------------------------------------
2025-05-07T20:27:17.9206970Z                                            Total:        15.4 MB
2025-05-07T20:27:17.9207188Z 
2025-05-07T20:27:17.9207320Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:17.9207554Z 
2025-05-07T20:27:17.9207779Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:17.9208347Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:17.9209084Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:17.9209675Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:17.9210211Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:17.9210775Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:17.9211613Z   numpy              conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 
2025-05-07T20:27:17.9211908Z 
2025-05-07T20:27:17.9211912Z 
2025-05-07T20:27:17.9211916Z 
2025-05-07T20:27:17.9212064Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:17.9212454Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:17.9212683Z 
2025-05-07T20:27:17.9212967Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:17.9213218Z 
2025-05-07T20:27:17.9213222Z 
2025-05-07T20:27:17.9223470Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:17.9223902Z 
2025-05-07T20:27:17.9223907Z 
2025-05-07T20:27:17.9223912Z 
2025-05-07T20:27:17.9254757Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:17.9255153Z 
2025-05-07T20:27:17.9255159Z 
2025-05-07T20:27:17.9255165Z 
2025-05-07T20:27:17.9255170Z 
2025-05-07T20:27:17.9267780Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:17.9268156Z 
2025-05-07T20:27:17.9268175Z 
2025-05-07T20:27:17.9268181Z 
2025-05-07T20:27:17.9268186Z 
2025-05-07T20:27:17.9273678Z 
2025-05-07T20:27:17.9277071Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:17.9277460Z 
2025-05-07T20:27:17.9277465Z 
2025-05-07T20:27:17.9277471Z 
2025-05-07T20:27:17.9277476Z 
2025-05-07T20:27:17.9277481Z 
2025-05-07T20:27:17.9284083Z 
2025-05-07T20:27:18.0326976Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:18.0327395Z 
2025-05-07T20:27:18.0327400Z 
2025-05-07T20:27:18.0327406Z 
2025-05-07T20:27:18.0624727Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:18.0625117Z 
2025-05-07T20:27:18.0625122Z 
2025-05-07T20:27:18.0878576Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:18.0878969Z 
2025-05-07T20:27:18.0878975Z 
2025-05-07T20:27:18.0878980Z 
2025-05-07T20:27:18.0880937Z 
2025-05-07T20:27:18.0922624Z libblas-3.9.0        | 16 KB     | #########7 |  97% [A[A[A[A
2025-05-07T20:27:18.0923017Z 
2025-05-07T20:27:18.0923023Z 
2025-05-07T20:27:18.0924861Z 
2025-05-07T20:27:18.0950333Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:18.0950730Z 
2025-05-07T20:27:18.0950736Z 
2025-05-07T20:27:18.0950741Z 
2025-05-07T20:27:18.0957538Z 
2025-05-07T20:27:18.1276231Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:18.1276623Z 
2025-05-07T20:27:18.1276628Z 
2025-05-07T20:27:18.1276634Z 
2025-05-07T20:27:18.1276656Z 
2025-05-07T20:27:18.1286072Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:18.1286455Z 
2025-05-07T20:27:18.1286461Z 
2025-05-07T20:27:18.1286466Z 
2025-05-07T20:27:18.1286471Z 
2025-05-07T20:27:18.1286476Z 
2025-05-07T20:27:18.1294841Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:18.1295133Z 
2025-05-07T20:27:18.1295138Z 
2025-05-07T20:27:18.1295142Z 
2025-05-07T20:27:18.1295145Z 
2025-05-07T20:27:18.1295589Z 
2025-05-07T20:27:18.1320307Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:18.1320702Z 
2025-05-07T20:27:18.1320708Z 
2025-05-07T20:27:18.1320713Z 
2025-05-07T20:27:18.1350335Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:18.1467359Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:18.1467798Z 
2025-05-07T20:27:18.1467805Z 
2025-05-07T20:27:18.1467810Z 
2025-05-07T20:27:18.1467815Z 
2025-05-07T20:27:18.1467832Z 
2025-05-07T20:27:18.1467836Z 
2025-05-07T20:27:18.1478804Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:18.1479194Z 
2025-05-07T20:27:18.1479200Z 
2025-05-07T20:27:18.1479205Z 
2025-05-07T20:27:18.1479211Z 
2025-05-07T20:27:18.1479216Z 
2025-05-07T20:27:18.1479565Z 
2025-05-07T20:27:18.1534493Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:18.1534891Z 
2025-05-07T20:27:18.1586073Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:18.1586720Z 
2025-05-07T20:27:18.1586728Z 
2025-05-07T20:27:18.1586733Z 
2025-05-07T20:27:18.1586738Z 
2025-05-07T20:27:18.1586743Z 
2025-05-07T20:27:18.1676540Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:18.1676933Z 
2025-05-07T20:27:18.1676939Z 
2025-05-07T20:27:18.1678833Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:18.1679214Z 
2025-05-07T20:27:18.1679751Z 
2025-05-07T20:27:18.1997630Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:18.1998283Z 
2025-05-07T20:27:18.1998289Z 
2025-05-07T20:27:18.1998294Z 
2025-05-07T20:27:18.1998299Z 
2025-05-07T20:27:18.1998304Z 
2025-05-07T20:27:18.1998309Z 
2025-05-07T20:27:18.2352732Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:18.2534953Z numpy-2.2.5          | 8.1 MB    | #####9     |  59% 
2025-05-07T20:27:18.2535814Z 
2025-05-07T20:27:18.2689544Z libopenblas-0.3.29   | 5.6 MB    | #########7 |  98% [A
2025-05-07T20:27:18.2689956Z 
2025-05-07T20:27:18.2859285Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:18.2931006Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:18.2931352Z 
2025-05-07T20:27:18.2931358Z 
2025-05-07T20:27:18.3989301Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:18.3989599Z 
2025-05-07T20:27:18.6913047Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:18.6913783Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:18.6922107Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:18.6922461Z                                                      
2025-05-07T20:27:18.6922676Z 
2025-05-07T20:27:18.6922880Z                                                      [A
2025-05-07T20:27:18.6923093Z 
2025-05-07T20:27:18.6923097Z 
2025-05-07T20:27:18.6923290Z                                                      [A[A
2025-05-07T20:27:18.6923505Z 
2025-05-07T20:27:18.6923525Z 
2025-05-07T20:27:18.6923529Z 
2025-05-07T20:27:18.6923702Z                                                      [A[A[A
2025-05-07T20:27:18.6923925Z 
2025-05-07T20:27:18.6923929Z 
2025-05-07T20:27:18.6923932Z 
2025-05-07T20:27:18.6923936Z 
2025-05-07T20:27:18.6924112Z                                                      [A[A[A[A
2025-05-07T20:27:18.6924334Z 
2025-05-07T20:27:18.6924337Z 
2025-05-07T20:27:18.6924341Z 
2025-05-07T20:27:18.6924345Z 
2025-05-07T20:27:18.6924356Z 
2025-05-07T20:27:18.6924535Z                                                      [A[A[A[A[A
2025-05-07T20:27:18.6924756Z 
2025-05-07T20:27:18.6924760Z 
2025-05-07T20:27:18.6924763Z 
2025-05-07T20:27:18.6924767Z 
2025-05-07T20:27:18.6924770Z 
2025-05-07T20:27:18.6924774Z 
2025-05-07T20:27:18.6924966Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:18.7927220Z Preparing transaction: \ done
2025-05-07T20:27:18.9933289Z Verifying transaction: / - done
2025-05-07T20:27:19.0942250Z Executing transaction: | done
2025-05-07T20:27:19.2758515Z ################################################################################
2025-05-07T20:27:19.2758892Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:19.2759203Z #
2025-05-07T20:27:19.2776173Z # [2025-05-07T20:27:19.277Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:27:19.2776838Z ################################################################################
2025-05-07T20:27:19.2777077Z 
2025-05-07T20:27:19.2792213Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:19.3762604Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:19.3763113Z ################################################################################
2025-05-07T20:27:19.3763472Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:19.3763757Z #
2025-05-07T20:27:19.3780075Z # [2025-05-07T20:27:19.377Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:27:19.3780880Z ################################################################################
2025-05-07T20:27:19.3781114Z 
2025-05-07T20:27:19.3801759Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:19.3828220Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:27:19.3844976Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:19.3845540Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:27:19.3852893Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:19.3860282Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:27:19.3881390Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:32.0685012Z   DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334
2025-05-07T20:28:32.0687474Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:28:32.0688062Z Collecting torch
2025-05-07T20:28:32.0688853Z   Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:28:32.0689586Z Collecting filelock (from torch)
2025-05-07T20:28:32.0689832Z 
2025-05-07T20:28:32.0690163Z   Using cached https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:28:32.0691133Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2)
2025-05-07T20:28:32.0692244Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1)
2025-05-07T20:28:32.0692930Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:28:32.0693448Z   Using cached https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:28:32.0693977Z Collecting networkx (from torch)
2025-05-07T20:28:32.0694500Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:28:32.0695522Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 19.2 MB/s eta 0:00:00
2025-05-07T20:28:32.0695906Z Collecting jinja2 (from torch)
2025-05-07T20:28:32.0696407Z   Using cached https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:28:32.0696927Z Collecting fsspec (from torch)
2025-05-07T20:28:32.0697448Z   Using cached https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:28:32.0698051Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:28:32.0698903Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:32.0699764Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:28:32.0700631Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:32.0701494Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:28:32.0702334Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:32.0703155Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:28:32.0704243Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:28:32.0704990Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:28:32.0705736Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:32.0706477Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:28:32.0707470Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:32.0708296Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:28:32.0709030Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:32.0709784Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:28:32.0710557Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:32.0711317Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:28:32.0712156Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:32.0713001Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:28:32.0714031Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:28:32.0714764Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:28:32.0715619Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:28:32.0716429Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:28:32.0717226Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:32.0718031Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:28:32.0718873Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:28:32.0719711Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:28:32.0720714Z   Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:28:32.0721559Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:28:32.0722430Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:28:32.0723284Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:28:32.0723862Z   Using cached https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:28:32.0724414Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:28:32.0724937Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB)
2025-05-07T20:28:32.0725450Z   Preparing metadata (setup.py): started
2025-05-07T20:28:32.0725849Z   Preparing metadata (setup.py): finished with status 'done'
2025-05-07T20:28:32.0726632Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1047.0 MB)
2025-05-07T20:28:32.0727502Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 23.8 MB/s eta 0:00:00
2025-05-07T20:28:32.0728385Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:28:32.0729499Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:28:32.0730691Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:28:32.0731890Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:28:32.0733117Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:28:32.0734208Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:28:32.0735382Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:28:32.0736468Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:28:32.0737493Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:28:32.0738608Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:28:32.0739728Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:28:32.0740821Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:28:32.0742000Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:28:32.0743164Z Using cached https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:28:32.0744342Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:28:32.0745313Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 196.0 MB/s eta 0:00:00
2025-05-07T20:28:32.0745719Z Building wheels for collected packages: MarkupSafe
2025-05-07T20:28:32.0746119Z   Building wheel for MarkupSafe (setup.py): started
2025-05-07T20:28:32.0746578Z   Building wheel for MarkupSafe (setup.py): finished with status 'done'
2025-05-07T20:28:32.0747475Z   Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=8642341f746950f07f790b09c3e552393bd8cdf535cdc73dd539cf084cd476d7
2025-05-07T20:28:32.0748533Z   Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6
2025-05-07T20:28:32.0749138Z Successfully built MarkupSafe
2025-05-07T20:28:32.0750874Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:28:32.0752516Z 
2025-05-07T20:28:32.0754639Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:28:32.0756758Z 
2025-05-07T20:28:34.2952942Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:28:34.2957288Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:28:37.7141467Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:28:41.1478655Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:28:41.1479110Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:28:44.4966534Z True
2025-05-07T20:28:44.4966787Z True
2025-05-07T20:28:44.4966897Z 
2025-05-07T20:28:44.5593393Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:28:44.5633219Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:28:44.5633858Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:28:44.5647563Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:44.5647932Z env:
2025-05-07T20:28:44.5648164Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:44.5648475Z   BUILD_ENV: build_binary
2025-05-07T20:28:44.5648734Z   BUILD_TARGET: genai
2025-05-07T20:28:44.5649033Z   BUILD_VARIANT: cuda
2025-05-07T20:28:44.5649289Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:44.5649553Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:44.5649868Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:44.5650216Z ##[endgroup]
2025-05-07T20:28:44.8981962Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:28:44.8983690Z ################################################################################
2025-05-07T20:28:44.8984194Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:28:44.8984567Z #
2025-05-07T20:28:44.9000398Z # [2025-05-07T20:28:44.899Z] + collect_pytorch_env_info build_binary
2025-05-07T20:28:44.9000820Z ################################################################################
2025-05-07T20:28:44.9001044Z 
2025-05-07T20:28:44.9018390Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:44.9944436Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:44.9953964Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:28:44.9954634Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:28:44.9955039Z 
2025-05-07T20:28:45.0821301Z 
2025-05-07T20:28:45.0821733Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:28:45.0845255Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:28:50.9642005Z Collecting environment information...
2025-05-07T20:28:50.9642398Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:28:50.9642696Z Is debug build: False
2025-05-07T20:28:50.9642957Z CUDA used to build PyTorch: 12.8
2025-05-07T20:28:50.9643249Z ROCM used to build PyTorch: N/A
2025-05-07T20:28:50.9643432Z 
2025-05-07T20:28:50.9643538Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:28:50.9643873Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:28:50.9644207Z Clang version: Could not collect
2025-05-07T20:28:50.9644506Z CMake version: Could not collect
2025-05-07T20:28:50.9644780Z Libc version: glibc-2.34
2025-05-07T20:28:50.9644947Z 
2025-05-07T20:28:50.9645260Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime)
2025-05-07T20:28:50.9645939Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:28:50.9646362Z Is CUDA available: True
2025-05-07T20:28:50.9646623Z CUDA runtime version: 12.8.61
2025-05-07T20:28:50.9646905Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:28:50.9647225Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:28:50.9647562Z Nvidia driver version: 570.133.07
2025-05-07T20:28:50.9647853Z cuDNN version: Could not collect
2025-05-07T20:28:50.9648135Z HIP runtime version: N/A
2025-05-07T20:28:50.9648391Z MIOpen runtime version: N/A
2025-05-07T20:28:50.9648661Z Is XNNPACK available: True
2025-05-07T20:28:50.9648834Z 
2025-05-07T20:28:50.9648914Z CPU:
2025-05-07T20:28:50.9649139Z Architecture:                         x86_64
2025-05-07T20:28:50.9649801Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:28:50.9650208Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:28:50.9650609Z Byte Order:                           Little Endian
2025-05-07T20:28:50.9650930Z CPU(s):                               16
2025-05-07T20:28:50.9651241Z On-line CPU(s) list:                  0-15
2025-05-07T20:28:50.9651776Z Vendor ID:                            AuthenticAMD
2025-05-07T20:28:50.9652127Z Model name:                           AMD EPYC 7R32
2025-05-07T20:28:50.9652459Z CPU family:                           23
2025-05-07T20:28:50.9652758Z Model:                                49
2025-05-07T20:28:50.9653057Z Thread(s) per core:                   2
2025-05-07T20:28:50.9653355Z Core(s) per socket:                   8
2025-05-07T20:28:50.9653649Z Socket(s):                            1
2025-05-07T20:28:50.9653938Z Stepping:                             0
2025-05-07T20:28:50.9654257Z BogoMIPS:                             5599.99
2025-05-07T20:28:50.9656389Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:28:50.9658499Z Hypervisor vendor:                    KVM
2025-05-07T20:28:50.9658828Z Virtualization type:                  full
2025-05-07T20:28:50.9659179Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:28:50.9659553Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:28:50.9659935Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:28:50.9660307Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:28:50.9660633Z NUMA node(s):                         1
2025-05-07T20:28:50.9660936Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:28:50.9661284Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:28:50.9661666Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:28:50.9662037Z Vulnerability L1tf:                   Not affected
2025-05-07T20:28:50.9662399Z Vulnerability Mds:                    Not affected
2025-05-07T20:28:50.9662763Z Vulnerability Meltdown:               Not affected
2025-05-07T20:28:50.9663129Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:28:50.9663509Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:28:50.9664074Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:28:50.9664672Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:28:50.9665236Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:28:50.9665945Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:28:50.9666962Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:28:50.9667656Z Vulnerability Srbds:                  Not affected
2025-05-07T20:28:50.9668028Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:28:50.9668266Z 
2025-05-07T20:28:50.9668377Z Versions of relevant libraries:
2025-05-07T20:28:50.9668650Z [pip3] numpy==2.2.5
2025-05-07T20:28:50.9668903Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:28:50.9669225Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:28:50.9669542Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:28:50.9669979Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:28:50.9670306Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:28:50.9670610Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:28:50.9670909Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:28:50.9671223Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:28:50.9671546Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:28:50.9671970Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:28:50.9672287Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:28:50.9672587Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:28:50.9672895Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:28:50.9673200Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:28:50.9673522Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:28:50.9673909Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:50.9674418Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:50.9674961Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:50.9675504Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:28:50.9676055Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:50.9676613Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:28:50.9677120Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9677616Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:28:50.9678119Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:50.9678664Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:28:50.9679162Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9679658Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:50.9680256Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9680744Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9681241Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:28:50.9681747Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:28:50.9682232Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:50.9682716Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:28:50.9683209Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9683696Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:28:50.9684185Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9684676Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:28:50.9685178Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:50.9685688Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:28:50.9686198Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9686711Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:28:50.9687223Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:28:50.9687737Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:28:50.9688221Z [conda] numpy                     2.2.5           py313h17eae1a_0    conda-forge
2025-05-07T20:28:50.9688708Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:28:50.9689323Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:28:50.9689845Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:28:50.9690361Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:28:50.9690972Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:28:50.9691569Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:28:50.9692063Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:28:50.9692571Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:28:50.9693082Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:28:50.9693603Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:28:50.9694099Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:28:50.9694707Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:28:50.9695208Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:28:50.9695704Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:28:50.9696186Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:28:50.9696472Z 
2025-05-07T20:28:51.0374969Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:28:51.0375667Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:28:51.0387413Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:28:51.0387774Z env:
2025-05-07T20:28:51.0388011Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:28:51.0388323Z   BUILD_ENV: build_binary
2025-05-07T20:28:51.0388583Z   BUILD_TARGET: genai
2025-05-07T20:28:51.0388825Z   BUILD_VARIANT: cuda
2025-05-07T20:28:51.0389092Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:28:51.0389358Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:28:51.0389672Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:28:51.0390019Z ##[endgroup]
2025-05-07T20:28:51.3750889Z ################################################################################
2025-05-07T20:28:51.3751289Z # Prepare FBGEMM-GPU Build
2025-05-07T20:28:51.3766941Z #
2025-05-07T20:28:51.3767316Z # [2025-05-07T20:28:51.376Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:28:51.3767727Z ################################################################################
2025-05-07T20:28:51.3767961Z 
2025-05-07T20:28:51.3782844Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:51.4724544Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:51.4746589Z [BUILD] Running git submodules update ...
2025-05-07T20:28:51.4767928Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:28:51.5128986Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:28:51.5129494Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:28:51.5129940Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:28:51.5130346Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:28:51.5130765Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:28:51.5131207Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:28:51.5131630Z Synchronizing submodule url for '../external/json'
2025-05-07T20:28:51.5164858Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:28:51.5724511Z [BUILD] Installing other build dependencies ...
2025-05-07T20:28:51.5746539Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:28:53.9678083Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:28:53.9688840Z   Using cached backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:28:54.0096479Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:28:54.0105805Z   Using cached build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:28:54.1419886Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:28:54.1430362Z   Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:28:54.1868835Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:28:54.1877061Z   Using cached click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:28:54.4161410Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:28:54.4171872Z   Using cached hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:28:54.4257089Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:28:54.4260592Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:28:54.4688221Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:28:54.4696979Z   Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:28:54.4710527Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:28:54.5035081Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:28:54.5044392Z   Using cached pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:28:54.5908326Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:28:54.6080967Z   Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:28:54.6939332Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:28:54.6948647Z   Using cached scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:28:54.6998318Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:28:54.7457051Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:28:54.7466149Z   Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:28:54.7821941Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:28:54.7831335Z   Using cached tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:28:54.8173677Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:28:54.8183422Z   Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:28:54.8892275Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.8922632Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:28:54.9633339Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:28:54.9642361Z   Using cached pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:28:54.9975152Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:54.9984023Z   Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:55.0450315Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:28:55.0468245Z   Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:28:55.0484560Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:28:55.1010902Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:55.1037234Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:28:55.1513286Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:28:55.1836281Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:28:55.1845264Z   Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:28:55.1865133Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:28:55.2501281Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:28:55.2540860Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:28:55.2976531Z Using cached backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:28:55.2985455Z Using cached build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:28:55.2994654Z Using cached cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:28:55.3187265Z Using cached click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:28:55.3196394Z Using cached hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:28:55.3209027Z Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:28:55.3218232Z Using cached ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:28:55.3229471Z Using cached pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:28:55.3270333Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
2025-05-07T20:28:55.4022013Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 7.2 MB/s eta 0:00:00
2025-05-07T20:28:55.4030664Z Using cached scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:28:55.4040265Z Using cached setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:55.4049341Z Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:28:55.4058820Z Using cached patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:28:55.4070257Z Using cached attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:28:55.4098585Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:28:55.4575936Z Using cached distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:28:55.4584816Z Using cached pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:28:55.4612317Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:28:55.5168113Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:28:55.6855347Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:28:58.1095109Z 
2025-05-07T20:28:58.1149309Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:28:58.2830439Z ################################################################################
2025-05-07T20:28:58.2830840Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:28:58.2831109Z #
2025-05-07T20:28:58.2848825Z # [2025-05-07T20:28:58.284Z] + install_triton_pip build_binary
2025-05-07T20:28:58.2849230Z ################################################################################
2025-05-07T20:28:58.2849454Z 
2025-05-07T20:28:58.2850053Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:28:58.2850511Z ################################################################################
2025-05-07T20:28:58.2850895Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:28:58.2851229Z #
2025-05-07T20:28:58.2868912Z # [2025-05-07T20:28:58.286Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:28:58.2869458Z ################################################################################
2025-05-07T20:28:58.2869695Z 
2025-05-07T20:28:58.2886258Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:28:58.3840074Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:28:58.3840909Z ################################################################################
2025-05-07T20:28:58.3841592Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:28:58.3842172Z #
2025-05-07T20:28:58.3858101Z # [2025-05-07T20:28:58.385Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:28:58.3858605Z ################################################################################
2025-05-07T20:28:58.3858837Z 
2025-05-07T20:28:58.3904170Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:28:58.3920861Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:28:58.3921784Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:28:58.3929687Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:28:58.3939103Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:28:58.3960254Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:05.7333667Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:05.7334973Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:05.7335628Z 
2025-05-07T20:29:05.7335868Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:05.7336299Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:05.7337134Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:05.7338396Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:05.7339499Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 63.3 MB/s eta 0:00:00
2025-05-07T20:29:05.7339910Z Installing collected packages: pytorch-triton
2025-05-07T20:29:05.7340268Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:05.7340678Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:05.7341110Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:05.7341550Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:05.7342021Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:05.7342290Z 
2025-05-07T20:29:07.9482060Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:07.9486177Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:10.0996021Z ################################################################################
2025-05-07T20:29:10.0996504Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:10.0996897Z ################################################################################
2025-05-07T20:29:10.0997464Z 
2025-05-07T20:29:12.1614151Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:14.3351615Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:14.3355306Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:14.3401039Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:14.3401542Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:14.3413631Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:14.3414026Z env:
2025-05-07T20:29:14.3414268Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:14.3414583Z   BUILD_ENV: build_binary
2025-05-07T20:29:14.3414846Z   BUILD_TARGET: genai
2025-05-07T20:29:14.3415088Z   BUILD_VARIANT: cuda
2025-05-07T20:29:14.3415332Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:14.3415606Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:14.3415927Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:14.3416278Z ##[endgroup]
2025-05-07T20:29:14.6792281Z ################################################################################
2025-05-07T20:29:14.6792823Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:14.6793131Z #
2025-05-07T20:29:14.6809413Z # [2025-05-07T20:29:14.680Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.6810346Z ################################################################################
2025-05-07T20:29:14.6810602Z 
2025-05-07T20:29:14.6810984Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.6811717Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.6812073Z 
2025-05-07T20:29:14.6961526Z 891428e398d8fa44bdcd60728272fd376b27a8ba  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.6963838Z 
2025-05-07T20:29:14.6964467Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.6964999Z 
2025-05-07T20:29:14.7129759Z 86a533cac2dc47ba6525697cbaf3fe89eda98f1fc3bd69dfc08261cb1f2d2035  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.7131851Z 
2025-05-07T20:29:14.7132490Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.7132983Z 
2025-05-07T20:29:14.7459796Z 4c3714dae593cf99d3df6aac70dd67cf  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:14.7461869Z 
2025-05-07T20:29:14.7473751Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:14.7495072Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:17.5397790Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:17.5399007Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:17.5399880Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:17.5400432Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:17.5400708Z 
2025-05-07T20:29:24.4458494Z ################################################################################
2025-05-07T20:29:24.4458878Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:24.4459277Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:24.4459714Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:29:24.4460069Z [CHECK]
2025-05-07T20:29:24.4460433Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:24.4461020Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:24.4461416Z ################################################################################
2025-05-07T20:29:24.4462062Z 
2025-05-07T20:29:24.4462186Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:28.4268865Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:29:32.3777605Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:36.3468049Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:29:36.3473562Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:29:48.2166473Z ################################################################################
2025-05-07T20:29:48.2166937Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:29:48.2167287Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:29:48.2167649Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:29:48.2167998Z ################################################################################
2025-05-07T20:29:48.2168220Z 
2025-05-07T20:29:56.1583244Z ################################################################################
2025-05-07T20:29:56.1584155Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:29:56.1586185Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:29:56.1587813Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:29:56.1588358Z ################################################################################
2025-05-07T20:29:56.1588586Z 
2025-05-07T20:29:56.1588746Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:00.1317890Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:04.1126985Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:08.2003793Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:12.1791898Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:12.1795645Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:16.0736516Z fbgemm.nccl_init
2025-05-07T20:30:16.0736720Z 
2025-05-07T20:30:16.1350976Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:20.0208421Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:20.0208651Z 
2025-05-07T20:30:20.0823430Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:23.9656904Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:23.9657125Z 
2025-05-07T20:30:24.0266998Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:24.0268164Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:24.0304461Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:24.0304948Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:24.0318069Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:24.0318440Z env:
2025-05-07T20:30:24.0318683Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:24.0319000Z   BUILD_ENV: build_binary
2025-05-07T20:30:24.0319263Z   BUILD_TARGET: genai
2025-05-07T20:30:24.0319506Z   BUILD_VARIANT: cuda
2025-05-07T20:30:24.0319752Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:30:24.0320024Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:24.0320439Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:24.0320787Z ##[endgroup]
2025-05-07T20:30:24.3674540Z ################################################################################
2025-05-07T20:30:24.3674962Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:24.3675591Z #
2025-05-07T20:30:24.3689431Z # [2025-05-07T20:30:24.368Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:24.3690012Z ################################################################################
2025-05-07T20:30:24.3690310Z 
2025-05-07T20:30:32.2805202Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:30:32.2805804Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:30:32.2806230Z [TEST] Determined the test directories:
2025-05-07T20:30:32.2806557Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:30:32.2806896Z fbgemm_gpu/experimental/example/test
2025-05-07T20:30:32.2815632Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:30:32.2815846Z 
2025-05-07T20:30:32.2816664Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:30:32.2822154Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:30:32.2822623Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:30:32.2822949Z 
2025-05-07T20:30:32.7039667Z 
2025-05-07T20:30:32.7040122Z [TEST] Installing PyTest ...
2025-05-07T20:30:32.7064548Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:30:33.9603361Z Channels:
2025-05-07T20:30:33.9603718Z  - conda-forge
2025-05-07T20:30:33.9603966Z Platform: linux-64
2025-05-07T20:30:37.2477978Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:30:38.3847292Z Solving environment: \ | / done
2025-05-07T20:30:38.6153453Z 
2025-05-07T20:30:38.6153894Z ## Package Plan ##
2025-05-07T20:30:38.6154121Z 
2025-05-07T20:30:38.6154421Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:30:38.6154830Z 
2025-05-07T20:30:38.6154954Z   added / updated specs:
2025-05-07T20:30:38.6155288Z     - expecttest
2025-05-07T20:30:38.6155578Z     - pytest
2025-05-07T20:30:38.6155727Z 
2025-05-07T20:30:38.6155755Z 
2025-05-07T20:30:38.6155885Z The following packages will be downloaded:
2025-05-07T20:30:38.6156118Z 
2025-05-07T20:30:38.6156240Z     package                    |            build
2025-05-07T20:30:38.6156570Z     ---------------------------|-----------------
2025-05-07T20:30:38.6156963Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:30:38.6157434Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:30:38.6157917Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:30:38.6158374Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:30:38.6158826Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:30:38.6159259Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:30:38.6159689Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:30:38.6160630Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:30:38.6161044Z     ------------------------------------------------------------
2025-05-07T20:30:38.6161397Z                                            Total:         428 KB
2025-05-07T20:30:38.6161619Z 
2025-05-07T20:30:38.6161751Z The following NEW packages will be INSTALLED:
2025-05-07T20:30:38.6161976Z 
2025-05-07T20:30:38.6162204Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:30:38.6162733Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:30:38.6163269Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:30:38.6163766Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:30:38.6164249Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:30:38.6164705Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:30:38.6165332Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:30:38.6165768Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:30:38.6166033Z 
2025-05-07T20:30:38.6166037Z 
2025-05-07T20:30:38.6166042Z 
2025-05-07T20:30:38.6166198Z Downloading and Extracting Packages: ...working...
2025-05-07T20:30:38.6166574Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:30:38.6166812Z 
2025-05-07T20:30:38.6167816Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:30:38.6168068Z 
2025-05-07T20:30:38.6168072Z 
2025-05-07T20:30:38.6187437Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:30:38.6187801Z 
2025-05-07T20:30:38.6187807Z 
2025-05-07T20:30:38.6187812Z 
2025-05-07T20:30:38.6192608Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:30:38.6192961Z 
2025-05-07T20:30:38.6192965Z 
2025-05-07T20:30:38.6192978Z 
2025-05-07T20:30:38.6192982Z 
2025-05-07T20:30:38.6213257Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:30:38.6213877Z 
2025-05-07T20:30:38.6213881Z 
2025-05-07T20:30:38.6213885Z 
2025-05-07T20:30:38.6213896Z 
2025-05-07T20:30:38.6214071Z 
2025-05-07T20:30:38.6218075Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:30:38.6218354Z 
2025-05-07T20:30:38.6218359Z 
2025-05-07T20:30:38.6218363Z 
2025-05-07T20:30:38.6218366Z 
2025-05-07T20:30:38.6218370Z 
2025-05-07T20:30:38.6218374Z 
2025-05-07T20:30:38.6219332Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:30:38.6219616Z 
2025-05-07T20:30:38.6219620Z 
2025-05-07T20:30:38.6219623Z 
2025-05-07T20:30:38.6219627Z 
2025-05-07T20:30:38.6219631Z 
2025-05-07T20:30:38.6219642Z 
2025-05-07T20:30:38.6219645Z 
2025-05-07T20:30:38.6726029Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:30:38.6726418Z 
2025-05-07T20:30:38.6726422Z 
2025-05-07T20:30:38.6786728Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:38.6787099Z 
2025-05-07T20:30:38.6787112Z 
2025-05-07T20:30:38.6794283Z 
2025-05-07T20:30:38.6833689Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:38.6834066Z 
2025-05-07T20:30:38.7036058Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:38.7057599Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:38.7057928Z 
2025-05-07T20:30:38.7057934Z 
2025-05-07T20:30:38.7057940Z 
2025-05-07T20:30:38.7059590Z 
2025-05-07T20:30:38.7251953Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:38.7252265Z 
2025-05-07T20:30:38.7252270Z 
2025-05-07T20:30:38.7254279Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:38.7254539Z 
2025-05-07T20:30:38.7254543Z 
2025-05-07T20:30:38.7267163Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:30:38.7268658Z 
2025-05-07T20:30:38.7270675Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:38.7270944Z 
2025-05-07T20:30:38.7280590Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:30:38.7280860Z 
2025-05-07T20:30:38.7280864Z 
2025-05-07T20:30:38.7281352Z 
2025-05-07T20:30:38.7287077Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:38.7287348Z 
2025-05-07T20:30:38.7287352Z 
2025-05-07T20:30:38.7287355Z 
2025-05-07T20:30:38.7299860Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:30:38.7300245Z 
2025-05-07T20:30:38.7300251Z 
2025-05-07T20:30:38.7300256Z 
2025-05-07T20:30:38.7300261Z 
2025-05-07T20:30:38.7300266Z 
2025-05-07T20:30:38.7300272Z 
2025-05-07T20:30:38.7300528Z 
2025-05-07T20:30:38.7321598Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:38.7321889Z 
2025-05-07T20:30:38.7321893Z 
2025-05-07T20:30:38.7321905Z 
2025-05-07T20:30:38.7321909Z 
2025-05-07T20:30:38.7321913Z 
2025-05-07T20:30:38.7321916Z 
2025-05-07T20:30:38.7322110Z 
2025-05-07T20:30:38.7343490Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:38.7343872Z 
2025-05-07T20:30:38.7343878Z 
2025-05-07T20:30:38.7343884Z 
2025-05-07T20:30:38.7343889Z 
2025-05-07T20:30:38.7343895Z 
2025-05-07T20:30:38.7344038Z 
2025-05-07T20:30:38.7360869Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:38.7361294Z 
2025-05-07T20:30:38.7361300Z 
2025-05-07T20:30:38.7361305Z 
2025-05-07T20:30:38.7361308Z 
2025-05-07T20:30:38.7361312Z 
2025-05-07T20:30:38.7363765Z 
2025-05-07T20:30:38.7580433Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:38.7580730Z 
2025-05-07T20:30:38.7580733Z 
2025-05-07T20:30:38.7580737Z 
2025-05-07T20:30:38.7583895Z 
2025-05-07T20:30:38.7587092Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:38.7587822Z 
2025-05-07T20:30:38.7587833Z 
2025-05-07T20:30:38.7587842Z 
2025-05-07T20:30:38.7587872Z 
2025-05-07T20:30:38.7601586Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:30:38.7602011Z 
2025-05-07T20:30:38.7602015Z 
2025-05-07T20:30:38.7602019Z 
2025-05-07T20:30:38.7602023Z 
2025-05-07T20:30:38.7602026Z 
2025-05-07T20:30:38.7602030Z 
2025-05-07T20:30:38.7602033Z 
2025-05-07T20:30:38.7704123Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:30:38.7704806Z 
2025-05-07T20:30:38.7704814Z 
2025-05-07T20:30:38.7704821Z 
2025-05-07T20:30:38.7704828Z 
2025-05-07T20:30:38.7704835Z 
2025-05-07T20:30:38.7704843Z 
2025-05-07T20:30:38.7821403Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:30:38.7821949Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:38.8623141Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:30:38.8623563Z 
2025-05-07T20:30:38.8623569Z 
2025-05-07T20:30:38.8623574Z 
2025-05-07T20:30:38.8623579Z 
2025-05-07T20:30:38.8623584Z 
2025-05-07T20:30:38.8630214Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:30:38.8630554Z 
2025-05-07T20:30:38.8630558Z 
2025-05-07T20:30:38.8630562Z 
2025-05-07T20:30:38.8630566Z 
2025-05-07T20:30:38.8630569Z 
2025-05-07T20:30:38.8704963Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:38.8705235Z 
2025-05-07T20:30:38.8705240Z 
2025-05-07T20:30:38.8705243Z 
2025-05-07T20:30:38.8705247Z 
2025-05-07T20:30:38.8705251Z 
2025-05-07T20:30:38.8711955Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:30:38.8712328Z                                                      
2025-05-07T20:30:38.8712546Z 
2025-05-07T20:30:38.8712714Z                                                      [A
2025-05-07T20:30:38.8712920Z 
2025-05-07T20:30:38.8712924Z 
2025-05-07T20:30:38.8713099Z                                                      [A[A
2025-05-07T20:30:38.8713501Z 
2025-05-07T20:30:38.8713507Z 
2025-05-07T20:30:38.8713512Z 
2025-05-07T20:30:38.8713753Z                                                      [A[A[A
2025-05-07T20:30:38.8714177Z 
2025-05-07T20:30:38.8714182Z 
2025-05-07T20:30:38.8714186Z 
2025-05-07T20:30:38.8714190Z 
2025-05-07T20:30:38.8714378Z                                                      [A[A[A[A
2025-05-07T20:30:38.8714592Z 
2025-05-07T20:30:38.8714596Z 
2025-05-07T20:30:38.8714599Z 
2025-05-07T20:30:38.8714603Z 
2025-05-07T20:30:38.8714606Z 
2025-05-07T20:30:38.8714789Z                                                      [A[A[A[A[A
2025-05-07T20:30:38.8715003Z 
2025-05-07T20:30:38.8715007Z 
2025-05-07T20:30:38.8715010Z 
2025-05-07T20:30:38.8715014Z 
2025-05-07T20:30:38.8715017Z 
2025-05-07T20:30:38.8715021Z 
2025-05-07T20:30:38.8715212Z                                                      [A[A[A[A[A[A
2025-05-07T20:30:38.8715430Z 
2025-05-07T20:30:38.8715433Z 
2025-05-07T20:30:38.8715437Z 
2025-05-07T20:30:38.8715440Z 
2025-05-07T20:30:38.8715444Z 
2025-05-07T20:30:38.8715447Z 
2025-05-07T20:30:38.8715584Z 
2025-05-07T20:30:38.8715787Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:30:38.9718339Z Preparing transaction: \ done
2025-05-07T20:30:39.0721939Z Verifying transaction: / done
2025-05-07T20:30:40.9749151Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:30:41.1011128Z [TEST] Checking imports ...
2025-05-07T20:30:45.0304518Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:45.0318503Z [TEST] Setting feature flags ...
2025-05-07T20:30:45.0319023Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:30:45.0319377Z 
2025-05-07T20:30:45.4547020Z 
2025-05-07T20:30:45.4547556Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:30:45.4548714Z ################################################################################
2025-05-07T20:30:45.4549039Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:30:45.4549310Z #
2025-05-07T20:30:45.4568600Z # [2025-05-07T20:30:45.456Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:30:45.4569030Z ################################################################################
2025-05-07T20:30:45.4569263Z 
2025-05-07T20:30:45.4576793Z [TEST] Enumerating ALL test files ...
2025-05-07T20:30:45.4605892Z ./attention/gqa_test.py
2025-05-07T20:30:45.4606185Z ./coalesce/coalesce_test.py
2025-05-07T20:30:45.4606458Z ./comm/multi_gpu_car_test.py
2025-05-07T20:30:45.4606750Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:30:45.4607060Z ./kv_cache/kv_cache_test.py
2025-05-07T20:30:45.4607323Z ./moe/activation_test.py
2025-05-07T20:30:45.4607584Z ./moe/gather_scatter_test.py
2025-05-07T20:30:45.4607847Z ./moe/layers_test.py
2025-05-07T20:30:45.4608083Z ./moe/shuffling_test.py
2025-05-07T20:30:45.4608340Z ./quantize/quantize_test.py
2025-05-07T20:30:45.4608516Z 
2025-05-07T20:30:45.4608637Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:30:45.4608865Z 
2025-05-07T20:30:45.4626538Z ################################################################################
2025-05-07T20:30:45.4641612Z # [2025-05-07T20:30:45.463Z] Run Python Test Suite:
2025-05-07T20:30:45.4641956Z #   ./attention/gqa_test.py
2025-05-07T20:30:45.4642245Z ################################################################################
2025-05-07T20:30:45.4665584Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:30:45.4666198Z 
2025-05-07T20:30:48.0132787Z ============================= test session starts ==============================
2025-05-07T20:30:48.0133565Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:30:48.0134172Z cachedir: .pytest_cache
2025-05-07T20:30:48.0135371Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:30:48.0137388Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:30:48.0138242Z plugins: hypothesis-6.131.14
2025-05-07T20:30:49.6283238Z collecting ... collected 2 items
2025-05-07T20:30:49.6283686Z 
2025-05-07T20:31:27.8591218Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:27.8598097Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8598532Z     int4_kv=False,
2025-05-07T20:31:27.8598811Z     num_groups=1,
2025-05-07T20:31:27.8599069Z     B=1,
2025-05-07T20:31:27.8599304Z     MAX_T=4,
2025-05-07T20:31:27.8599550Z     N_H_L=1,
2025-05-07T20:31:27.8599791Z )
2025-05-07T20:31:27.8600038Z Trying example: test_gqa(
2025-05-07T20:31:27.8600552Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8600944Z     int4_kv=True,
2025-05-07T20:31:27.8601209Z     num_groups=1,
2025-05-07T20:31:27.8601468Z     B=1,
2025-05-07T20:31:27.8601697Z     MAX_T=4,
2025-05-07T20:31:27.8602365Z     N_H_L=1,
2025-05-07T20:31:27.8602610Z )
2025-05-07T20:31:27.8602867Z Trying example: test_gqa(
2025-05-07T20:31:27.8603236Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8603636Z     int4_kv=True,
2025-05-07T20:31:27.8603894Z     num_groups=4,
2025-05-07T20:31:27.8604151Z     B=23,
2025-05-07T20:31:27.8604388Z     MAX_T=33,
2025-05-07T20:31:27.8604630Z     N_H_L=68,
2025-05-07T20:31:27.8604871Z )
2025-05-07T20:31:27.8605116Z Trying example: test_gqa(
2025-05-07T20:31:27.8605485Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8605886Z     int4_kv=True,
2025-05-07T20:31:27.8606148Z     num_groups=4,
2025-05-07T20:31:27.8606408Z     B=77,
2025-05-07T20:31:27.8606639Z     MAX_T=4,
2025-05-07T20:31:27.8606884Z     N_H_L=1,
2025-05-07T20:31:27.8607122Z )
2025-05-07T20:31:27.8607361Z Trying example: test_gqa(
2025-05-07T20:31:27.8607726Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8608120Z     int4_kv=True,
2025-05-07T20:31:27.8608386Z     num_groups=4,
2025-05-07T20:31:27.8608640Z     B=77,
2025-05-07T20:31:27.8608880Z     MAX_T=52,
2025-05-07T20:31:27.8609118Z     N_H_L=67,
2025-05-07T20:31:27.8609362Z )
2025-05-07T20:31:27.8609606Z Trying example: test_gqa(
2025-05-07T20:31:27.8609969Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8610368Z     int4_kv=False,
2025-05-07T20:31:27.8610635Z     num_groups=4,
2025-05-07T20:31:27.8610886Z     B=57,
2025-05-07T20:31:27.8611123Z     MAX_T=45,
2025-05-07T20:31:27.8611375Z     N_H_L=120,
2025-05-07T20:31:27.8611615Z )
2025-05-07T20:31:27.8611862Z Trying example: test_gqa(
2025-05-07T20:31:27.8612231Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8612623Z     int4_kv=True,
2025-05-07T20:31:27.8612885Z     num_groups=4,
2025-05-07T20:31:27.8613146Z     B=52,
2025-05-07T20:31:27.8613669Z     MAX_T=42,
2025-05-07T20:31:27.8613929Z     N_H_L=53,
2025-05-07T20:31:27.8614177Z )
2025-05-07T20:31:27.8614413Z Trying example: test_gqa(
2025-05-07T20:31:27.8614787Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8615192Z     int4_kv=True,
2025-05-07T20:31:27.8615446Z     num_groups=1,
2025-05-07T20:31:27.8615702Z     B=77,
2025-05-07T20:31:27.8615965Z     MAX_T=95,
2025-05-07T20:31:27.8616237Z     N_H_L=53,
2025-05-07T20:31:27.8616473Z )
2025-05-07T20:31:27.8616718Z Trying example: test_gqa(
2025-05-07T20:31:27.8617086Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8617475Z     int4_kv=True,
2025-05-07T20:31:27.8617741Z     num_groups=4,
2025-05-07T20:31:27.8617999Z     B=113,
2025-05-07T20:31:27.8618230Z     MAX_T=48,
2025-05-07T20:31:27.8618481Z     N_H_L=96,
2025-05-07T20:31:27.8618730Z )
2025-05-07T20:31:27.8618974Z Trying example: test_gqa(
2025-05-07T20:31:27.8619341Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8619738Z     int4_kv=False,
2025-05-07T20:31:27.8619998Z     num_groups=1,
2025-05-07T20:31:27.8620256Z     B=51,
2025-05-07T20:31:27.8620499Z     MAX_T=61,
2025-05-07T20:31:27.8620737Z     N_H_L=69,
2025-05-07T20:31:27.8621247Z )
2025-05-07T20:31:27.8621502Z Trying example: test_gqa(
2025-05-07T20:31:27.8621862Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8622259Z     int4_kv=False,
2025-05-07T20:31:27.8622529Z     num_groups=4,
2025-05-07T20:31:27.8622780Z     B=17,
2025-05-07T20:31:27.8623022Z     MAX_T=113,
2025-05-07T20:31:27.8623274Z     N_H_L=65,
2025-05-07T20:31:27.8623509Z )
2025-05-07T20:31:27.8623759Z Trying example: test_gqa(
2025-05-07T20:31:27.8624130Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8624522Z     int4_kv=False,
2025-05-07T20:31:27.8624786Z     num_groups=4,
2025-05-07T20:31:27.8625044Z     B=17,
2025-05-07T20:31:27.8625279Z     MAX_T=65,
2025-05-07T20:31:27.8625520Z     N_H_L=65,
2025-05-07T20:31:27.8625761Z )
2025-05-07T20:31:27.8626003Z Trying example: test_gqa(
2025-05-07T20:31:27.8626363Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8626900Z     int4_kv=False,
2025-05-07T20:31:27.8627167Z     num_groups=4,
2025-05-07T20:31:27.8627429Z     B=65,
2025-05-07T20:31:27.8627668Z     MAX_T=65,
2025-05-07T20:31:27.8627911Z     N_H_L=65,
2025-05-07T20:31:27.8628144Z )
2025-05-07T20:31:27.8628390Z Trying example: test_gqa(
2025-05-07T20:31:27.8628753Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8629143Z     int4_kv=False,
2025-05-07T20:31:27.8629409Z     num_groups=1,
2025-05-07T20:31:27.8629666Z     B=6,
2025-05-07T20:31:27.8629898Z     MAX_T=108,
2025-05-07T20:31:27.8630148Z     N_H_L=14,
2025-05-07T20:31:27.8630387Z )
2025-05-07T20:31:27.8630622Z Trying example: test_gqa(
2025-05-07T20:31:27.8630988Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8631381Z     int4_kv=False,
2025-05-07T20:31:27.8631638Z     num_groups=1,
2025-05-07T20:31:27.8631894Z     B=6,
2025-05-07T20:31:27.8632127Z     MAX_T=14,
2025-05-07T20:31:27.8632365Z     N_H_L=14,
2025-05-07T20:31:27.8632611Z )
2025-05-07T20:31:27.8632863Z Trying example: test_gqa(
2025-05-07T20:31:27.8633228Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8633627Z     int4_kv=False,
2025-05-07T20:31:27.8633890Z     num_groups=1,
2025-05-07T20:31:27.8634146Z     B=6,
2025-05-07T20:31:27.8634371Z     MAX_T=6,
2025-05-07T20:31:27.8634614Z     N_H_L=14,
2025-05-07T20:31:27.8634854Z )
2025-05-07T20:31:27.8635092Z Trying example: test_gqa(
2025-05-07T20:31:27.8635457Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8635857Z     int4_kv=False,
2025-05-07T20:31:27.8636141Z     num_groups=1,
2025-05-07T20:31:27.8636421Z     B=6,
2025-05-07T20:31:27.8636654Z     MAX_T=6,
2025-05-07T20:31:27.8636894Z     N_H_L=6,
2025-05-07T20:31:27.8637136Z )
2025-05-07T20:31:27.8637381Z Trying example: test_gqa(
2025-05-07T20:31:27.8637741Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8638142Z     int4_kv=False,
2025-05-07T20:31:27.8638409Z     num_groups=1,
2025-05-07T20:31:27.8638666Z     B=70,
2025-05-07T20:31:27.8638906Z     MAX_T=94,
2025-05-07T20:31:27.8639153Z     N_H_L=78,
2025-05-07T20:31:27.8639395Z )
2025-05-07T20:31:27.8639640Z Trying example: test_gqa(
2025-05-07T20:31:27.8640005Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8640563Z     int4_kv=False,
2025-05-07T20:31:27.8640848Z     num_groups=1,
2025-05-07T20:31:27.8641117Z     B=78,
2025-05-07T20:31:27.8641358Z     MAX_T=94,
2025-05-07T20:31:27.8641616Z     N_H_L=78,
2025-05-07T20:31:27.8641869Z )
2025-05-07T20:31:27.8642060Z Trying example: test_gqa(
2025-05-07T20:31:27.8642356Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8642674Z     int4_kv=False,
2025-05-07T20:31:27.8642884Z     num_groups=1,
2025-05-07T20:31:27.8643091Z     B=94,
2025-05-07T20:31:27.8643282Z     MAX_T=94,
2025-05-07T20:31:27.8643479Z     N_H_L=78,
2025-05-07T20:31:27.8643667Z )
2025-05-07T20:31:27.8643862Z Trying example: test_gqa(
2025-05-07T20:31:27.8644160Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8644482Z     int4_kv=False,
2025-05-07T20:31:27.8644808Z     num_groups=1,
2025-05-07T20:31:27.8645017Z     B=94,
2025-05-07T20:31:27.8645201Z     MAX_T=94,
2025-05-07T20:31:27.8645398Z     N_H_L=94,
2025-05-07T20:31:27.8645593Z )
2025-05-07T20:31:27.8645790Z Trying example: test_gqa(
2025-05-07T20:31:27.8646092Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8646459Z     int4_kv=False,
2025-05-07T20:31:27.8646666Z     num_groups=4,
2025-05-07T20:31:27.8646874Z     B=41,
2025-05-07T20:31:27.8647064Z     MAX_T=105,
2025-05-07T20:31:27.8647261Z     N_H_L=126,
2025-05-07T20:31:27.8647642Z )
2025-05-07T20:31:27.8647841Z Trying example: test_gqa(
2025-05-07T20:31:27.8648132Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8648452Z     int4_kv=False,
2025-05-07T20:31:27.8648668Z     num_groups=4,
2025-05-07T20:31:27.8648870Z     B=105,
2025-05-07T20:31:27.8649062Z     MAX_T=105,
2025-05-07T20:31:27.8649264Z     N_H_L=126,
2025-05-07T20:31:27.8649549Z )
2025-05-07T20:31:27.8649753Z Trying example: test_gqa(
2025-05-07T20:31:27.8650051Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8650377Z     int4_kv=False,
2025-05-07T20:31:27.8650593Z     num_groups=4,
2025-05-07T20:31:27.8650795Z     B=105,
2025-05-07T20:31:27.8650990Z     MAX_T=105,
2025-05-07T20:31:27.8651196Z     N_H_L=105,
2025-05-07T20:31:27.8651391Z )
2025-05-07T20:31:27.8651592Z Trying example: test_gqa(
2025-05-07T20:31:27.8651889Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8652202Z     int4_kv=True,
2025-05-07T20:31:27.8652414Z     num_groups=1,
2025-05-07T20:31:27.8652622Z     B=95,
2025-05-07T20:31:27.8652812Z     MAX_T=114,
2025-05-07T20:31:27.8653017Z     N_H_L=43,
2025-05-07T20:31:27.8653221Z )
2025-05-07T20:31:27.8653411Z Trying example: test_gqa(
2025-05-07T20:31:27.8653709Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8654025Z     int4_kv=True,
2025-05-07T20:31:27.8654237Z     num_groups=1,
2025-05-07T20:31:27.8654444Z     B=43,
2025-05-07T20:31:27.8654635Z     MAX_T=114,
2025-05-07T20:31:27.8654841Z     N_H_L=43,
2025-05-07T20:31:27.8655033Z )
2025-05-07T20:31:27.8655230Z Trying example: test_gqa(
2025-05-07T20:31:27.8655527Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8655836Z     int4_kv=True,
2025-05-07T20:31:27.8656050Z     num_groups=1,
2025-05-07T20:31:27.8656257Z     B=43,
2025-05-07T20:31:27.8656443Z     MAX_T=43,
2025-05-07T20:31:27.8656640Z     N_H_L=43,
2025-05-07T20:31:27.8656836Z )
2025-05-07T20:31:27.8657027Z Trying example: test_gqa(
2025-05-07T20:31:27.8657340Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8657660Z     int4_kv=False,
2025-05-07T20:31:27.8658018Z     num_groups=1,
2025-05-07T20:31:27.8658430Z     B=21,
2025-05-07T20:31:27.8658709Z     MAX_T=38,
2025-05-07T20:31:27.8658982Z     N_H_L=42,
2025-05-07T20:31:27.8659318Z )
2025-05-07T20:31:27.8659606Z Trying example: test_gqa(
2025-05-07T20:31:27.8667886Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8668266Z     int4_kv=False,
2025-05-07T20:31:27.8668497Z     num_groups=1,
2025-05-07T20:31:27.8668710Z     B=38,
2025-05-07T20:31:27.8668908Z     MAX_T=38,
2025-05-07T20:31:27.8669112Z     N_H_L=42,
2025-05-07T20:31:27.8669307Z )
2025-05-07T20:31:27.8669513Z Trying example: test_gqa(
2025-05-07T20:31:27.8669827Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8670147Z     int4_kv=False,
2025-05-07T20:31:27.8670370Z     num_groups=1,
2025-05-07T20:31:27.8670586Z     B=38,
2025-05-07T20:31:27.8670775Z     MAX_T=42,
2025-05-07T20:31:27.8670979Z     N_H_L=42,
2025-05-07T20:31:27.8671177Z )
2025-05-07T20:31:27.8671372Z Trying example: test_gqa(
2025-05-07T20:31:27.8671679Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8672007Z     int4_kv=False,
2025-05-07T20:31:27.8672220Z     num_groups=1,
2025-05-07T20:31:27.8672436Z     B=42,
2025-05-07T20:31:27.8672632Z     MAX_T=42,
2025-05-07T20:31:27.8672833Z     N_H_L=42,
2025-05-07T20:31:27.8673033Z )
2025-05-07T20:31:27.8673358Z Trying example: test_gqa(
2025-05-07T20:31:27.8673655Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8673976Z     int4_kv=True,
2025-05-07T20:31:27.8674193Z     num_groups=1,
2025-05-07T20:31:27.8674402Z     B=74,
2025-05-07T20:31:27.8674587Z     MAX_T=20,
2025-05-07T20:31:27.8674787Z     N_H_L=15,
2025-05-07T20:31:27.8674982Z )
2025-05-07T20:31:27.8675175Z Trying example: test_gqa(
2025-05-07T20:31:27.8675475Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8675796Z     int4_kv=True,
2025-05-07T20:31:27.8676004Z     num_groups=1,
2025-05-07T20:31:27.8676213Z     B=20,
2025-05-07T20:31:27.8676409Z     MAX_T=20,
2025-05-07T20:31:27.8676602Z     N_H_L=15,
2025-05-07T20:31:27.8676798Z )
2025-05-07T20:31:27.8676995Z Trying example: test_gqa(
2025-05-07T20:31:27.8677286Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8677606Z     int4_kv=True,
2025-05-07T20:31:27.8677911Z     num_groups=1,
2025-05-07T20:31:27.8678113Z     B=20,
2025-05-07T20:31:27.8678310Z     MAX_T=15,
2025-05-07T20:31:27.8678508Z     N_H_L=15,
2025-05-07T20:31:27.8678697Z )
2025-05-07T20:31:27.8678894Z Trying example: test_gqa(
2025-05-07T20:31:27.8679196Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8679510Z     int4_kv=True,
2025-05-07T20:31:27.8679725Z     num_groups=1,
2025-05-07T20:31:27.8679934Z     B=15,
2025-05-07T20:31:27.8680176Z     MAX_T=20,
2025-05-07T20:31:27.8680381Z     N_H_L=15,
2025-05-07T20:31:27.8680578Z )
2025-05-07T20:31:27.8680769Z Trying example: test_gqa(
2025-05-07T20:31:27.8681066Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8681393Z     int4_kv=True,
2025-05-07T20:31:27.8681609Z     num_groups=1,
2025-05-07T20:31:27.8681811Z     B=15,
2025-05-07T20:31:27.8682003Z     MAX_T=15,
2025-05-07T20:31:27.8682202Z     N_H_L=15,
2025-05-07T20:31:27.8682392Z )
2025-05-07T20:31:27.8682590Z Trying example: test_gqa(
2025-05-07T20:31:27.8682897Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8683218Z     int4_kv=False,
2025-05-07T20:31:27.8683437Z     num_groups=4,
2025-05-07T20:31:27.8683648Z     B=117,
2025-05-07T20:31:27.8683839Z     MAX_T=104,
2025-05-07T20:31:27.8684043Z     N_H_L=69,
2025-05-07T20:31:27.8684242Z )
2025-05-07T20:31:27.8684435Z Trying example: test_gqa(
2025-05-07T20:31:27.8684736Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8685059Z     int4_kv=False,
2025-05-07T20:31:27.8685269Z     num_groups=4,
2025-05-07T20:31:27.8685482Z     B=117,
2025-05-07T20:31:27.8685676Z     MAX_T=117,
2025-05-07T20:31:27.8685872Z     N_H_L=69,
2025-05-07T20:31:27.8686100Z )
2025-05-07T20:31:27.8686323Z Trying example: test_gqa(
2025-05-07T20:31:27.8686617Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8686942Z     int4_kv=False,
2025-05-07T20:31:27.8687162Z     num_groups=4,
2025-05-07T20:31:27.8687363Z     B=69,
2025-05-07T20:31:27.8687563Z     MAX_T=117,
2025-05-07T20:31:27.8687766Z     N_H_L=69,
2025-05-07T20:31:27.8687956Z )
2025-05-07T20:31:27.8688156Z Trying example: test_gqa(
2025-05-07T20:31:27.8688455Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:27.8688774Z     int4_kv=False,
2025-05-07T20:31:27.8688982Z     num_groups=4,
2025-05-07T20:31:27.8689194Z     B=117,
2025-05-07T20:31:27.8689388Z     MAX_T=69,
2025-05-07T20:31:27.8689580Z     N_H_L=69,
2025-05-07T20:31:27.8689777Z )
2025-05-07T20:31:27.8689967Z PASSED
2025-05-07T20:31:27.8774795Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:27.8775138Z 
2025-05-07T20:31:27.8775297Z =========================== short test summary info ============================
2025-05-07T20:31:27.8776425Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:27.8777887Z ======================== 1 passed, 1 skipped in 40.38s =========================
2025-05-07T20:31:28.5397795Z 
2025-05-07T20:31:28.5399834Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:28.5419751Z [TEST] Python test time for ./attention/gqa_test.py: 43 seconds
2025-05-07T20:31:28.5420050Z 
2025-05-07T20:31:28.5420411Z 
2025-05-07T20:31:28.5420415Z 
2025-05-07T20:31:28.5420548Z 
2025-05-07T20:31:28.5443519Z ################################################################################
2025-05-07T20:31:28.5459320Z # [2025-05-07T20:31:28.545Z] Run Python Test Suite:
2025-05-07T20:31:28.5459675Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:28.5459980Z ################################################################################
2025-05-07T20:31:28.5483889Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:28.5484538Z 
2025-05-07T20:31:30.7038570Z ============================= test session starts ==============================
2025-05-07T20:31:30.7041213Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:30.7041762Z cachedir: .pytest_cache
2025-05-07T20:31:30.7042377Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:30.7043147Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:30.7043583Z plugins: hypothesis-6.131.14
2025-05-07T20:31:32.2442593Z collecting ... collected 1 item
2025-05-07T20:31:32.2442872Z 
2025-05-07T20:31:32.9926914Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:31:32.9927261Z 
2025-05-07T20:31:32.9927438Z ============================== 1 passed in 2.42s ===============================
2025-05-07T20:31:33.6353613Z 
2025-05-07T20:31:33.6354138Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:31:33.6374400Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:31:33.6374841Z 
2025-05-07T20:31:33.6374848Z 
2025-05-07T20:31:33.6374854Z 
2025-05-07T20:31:33.6374859Z 
2025-05-07T20:31:33.6395209Z ################################################################################
2025-05-07T20:31:33.6410863Z # [2025-05-07T20:31:33.640Z] Run Python Test Suite:
2025-05-07T20:31:33.6411331Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:31:33.6411638Z ################################################################################
2025-05-07T20:31:33.6437251Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:31:33.6437887Z 
2025-05-07T20:31:35.7987867Z ============================= test session starts ==============================
2025-05-07T20:31:35.7989405Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:35.7990519Z cachedir: .pytest_cache
2025-05-07T20:31:35.7991739Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:35.7993239Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:35.7994084Z plugins: hypothesis-6.131.14
2025-05-07T20:31:37.4198413Z collecting ... collected 5 items
2025-05-07T20:31:37.4198722Z 
2025-05-07T20:31:37.4209035Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:31:37.4216921Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:31:37.4223607Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:31:37.4234542Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:31:37.4249528Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:31:37.4250033Z 
2025-05-07T20:31:37.4250648Z =========================== short test summary info ============================
2025-05-07T20:31:37.4251386Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:37.4252363Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:37.4253334Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:37.4254303Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:37.4255267Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:31:37.4256108Z ============================== 5 skipped in 1.76s ==============================
2025-05-07T20:31:38.0114182Z 
2025-05-07T20:31:38.0114853Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:31:38.0134405Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:31:38.0134809Z 
2025-05-07T20:31:38.0134814Z 
2025-05-07T20:31:38.0134818Z 
2025-05-07T20:31:38.0134821Z 
2025-05-07T20:31:38.0155236Z ################################################################################
2025-05-07T20:31:38.0172190Z # [2025-05-07T20:31:38.016Z] Run Python Test Suite:
2025-05-07T20:31:38.0172548Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:38.0172871Z ################################################################################
2025-05-07T20:31:38.0198636Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:38.0199341Z 
2025-05-07T20:31:40.1757287Z ============================= test session starts ==============================
2025-05-07T20:31:40.1757955Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:40.1758508Z cachedir: .pytest_cache
2025-05-07T20:31:40.1759124Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:40.1759890Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:40.1760433Z plugins: hypothesis-6.131.14
2025-05-07T20:31:41.8168659Z collecting ... collected 2 items
2025-05-07T20:31:41.8169109Z 
2025-05-07T20:31:41.8178294Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:31:41.8192731Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:31:41.8193385Z 
2025-05-07T20:31:41.8193557Z =========================== short test summary info ============================
2025-05-07T20:31:41.8194210Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:41.8195082Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:31:41.8195709Z ============================== 2 skipped in 1.78s ==============================
2025-05-07T20:31:42.4196060Z 
2025-05-07T20:31:42.4196797Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:42.4217719Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:31:42.4218062Z 
2025-05-07T20:31:42.4218067Z 
2025-05-07T20:31:42.4218071Z 
2025-05-07T20:31:42.4218075Z 
2025-05-07T20:31:42.4238476Z ################################################################################
2025-05-07T20:31:42.4254131Z # [2025-05-07T20:31:42.425Z] Run Python Test Suite:
2025-05-07T20:31:42.4254825Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:31:42.4255133Z ################################################################################
2025-05-07T20:31:42.4280388Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:31:42.4281027Z 
2025-05-07T20:31:44.5816832Z ============================= test session starts ==============================
2025-05-07T20:31:44.5817491Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:44.5818032Z cachedir: .pytest_cache
2025-05-07T20:31:44.5818639Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:44.5819402Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:44.5820196Z plugins: hypothesis-6.131.14
2025-05-07T20:31:46.1636909Z collecting ... collected 4 items
2025-05-07T20:31:46.1637236Z 
2025-05-07T20:31:48.4147114Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:31:48.4228966Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:31:48.4318558Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:31:48.4405677Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:31:48.4406048Z 
2025-05-07T20:31:48.4406211Z =========================== short test summary info ============================
2025-05-07T20:31:48.4406935Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:31:48.4407892Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:31:48.4408566Z ============================== 4 skipped in 3.99s ==============================
2025-05-07T20:31:50.7160008Z 
2025-05-07T20:31:50.7160994Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:31:50.7181226Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 8 seconds
2025-05-07T20:31:50.7181628Z 
2025-05-07T20:31:50.7181635Z 
2025-05-07T20:31:50.7181640Z 
2025-05-07T20:31:50.7181645Z 
2025-05-07T20:31:50.7203646Z ################################################################################
2025-05-07T20:31:50.7223317Z # [2025-05-07T20:31:50.722Z] Run Python Test Suite:
2025-05-07T20:31:50.7223735Z #   ./moe/activation_test.py
2025-05-07T20:31:50.7224058Z ################################################################################
2025-05-07T20:31:50.7248171Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:31:50.7248827Z 
2025-05-07T20:31:52.8772534Z ============================= test session starts ==============================
2025-05-07T20:31:52.8773330Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:52.8773870Z cachedir: .pytest_cache
2025-05-07T20:31:52.8774486Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:52.8775254Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:52.8775690Z plugins: hypothesis-6.131.14
2025-05-07T20:31:54.4646121Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:31:54.5615099Z collecting ... collected 2 items
2025-05-07T20:31:54.5615417Z 
2025-05-07T20:31:59.4786600Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:31:59.4787376Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4788233Z     T=1,
2025-05-07T20:31:59.4788528Z     D=5120,
2025-05-07T20:31:59.4788828Z     contiguous=True,
2025-05-07T20:31:59.4789088Z     compiled=True,
2025-05-07T20:31:59.4789374Z )
2025-05-07T20:31:59.4789676Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4790263Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4790809Z     T=4096,
2025-05-07T20:31:59.4791086Z     D=5120,
2025-05-07T20:31:59.4791296Z     contiguous=True,
2025-05-07T20:31:59.4791524Z     compiled=True,
2025-05-07T20:31:59.4791738Z )
2025-05-07T20:31:59.4791943Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4792328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4792723Z     T=4096,
2025-05-07T20:31:59.4792918Z     D=7168,
2025-05-07T20:31:59.4793128Z     contiguous=False,
2025-05-07T20:31:59.4793359Z     compiled=False,
2025-05-07T20:31:59.4793797Z )
2025-05-07T20:31:59.4794100Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4794680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4795237Z     T=4096,
2025-05-07T20:31:59.4795504Z     D=5120,
2025-05-07T20:31:59.4795704Z     contiguous=False,
2025-05-07T20:31:59.4795941Z     compiled=True,
2025-05-07T20:31:59.4796154Z )
2025-05-07T20:31:59.4796355Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4796748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4797142Z     T=1,
2025-05-07T20:31:59.4797327Z     D=7168,
2025-05-07T20:31:59.4797531Z     contiguous=True,
2025-05-07T20:31:59.4797764Z     compiled=True,
2025-05-07T20:31:59.4797971Z )
2025-05-07T20:31:59.4798179Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4798571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4798957Z     T=1,
2025-05-07T20:31:59.4799150Z     D=7168,
2025-05-07T20:31:59.4799362Z     contiguous=False,
2025-05-07T20:31:59.4799599Z     compiled=True,
2025-05-07T20:31:59.4799814Z )
2025-05-07T20:31:59.4800026Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4800510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4800899Z     T=4096,
2025-05-07T20:31:59.4801132Z     D=5120,
2025-05-07T20:31:59.4801338Z     contiguous=False,
2025-05-07T20:31:59.4801580Z     compiled=False,
2025-05-07T20:31:59.4801790Z )
2025-05-07T20:31:59.4802000Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4802393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4802786Z     T=1,
2025-05-07T20:31:59.4802973Z     D=7168,
2025-05-07T20:31:59.4803178Z     contiguous=True,
2025-05-07T20:31:59.4803420Z     compiled=False,
2025-05-07T20:31:59.4803633Z )
2025-05-07T20:31:59.4803839Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4804233Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4804627Z     T=2048,
2025-05-07T20:31:59.4804823Z     D=5120,
2025-05-07T20:31:59.4805044Z     contiguous=True,
2025-05-07T20:31:59.4805304Z     compiled=True,
2025-05-07T20:31:59.4805517Z )
2025-05-07T20:31:59.4805722Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4806106Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4806501Z     T=2048,
2025-05-07T20:31:59.4806697Z     D=7168,
2025-05-07T20:31:59.4806894Z     contiguous=True,
2025-05-07T20:31:59.4807127Z     compiled=True,
2025-05-07T20:31:59.4807341Z )
2025-05-07T20:31:59.4807541Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4807932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4808326Z     T=2048,
2025-05-07T20:31:59.4808521Z     D=7168,
2025-05-07T20:31:59.4808718Z     contiguous=True,
2025-05-07T20:31:59.4808952Z     compiled=False,
2025-05-07T20:31:59.4809164Z )
2025-05-07T20:31:59.4809364Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4809762Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4810281Z     T=128,
2025-05-07T20:31:59.4810473Z     D=5120,
2025-05-07T20:31:59.4810675Z     contiguous=False,
2025-05-07T20:31:59.4810912Z     compiled=True,
2025-05-07T20:31:59.4811119Z )
2025-05-07T20:31:59.4811327Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4811722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4812109Z     T=128,
2025-05-07T20:31:59.4812305Z     D=5120,
2025-05-07T20:31:59.4812511Z     contiguous=True,
2025-05-07T20:31:59.4812739Z     compiled=True,
2025-05-07T20:31:59.4812952Z )
2025-05-07T20:31:59.4813161Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4813949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4814393Z     T=16384,
2025-05-07T20:31:59.4814598Z     D=5120,
2025-05-07T20:31:59.4814798Z     contiguous=False,
2025-05-07T20:31:59.4815036Z     compiled=True,
2025-05-07T20:31:59.4815417Z )
2025-05-07T20:31:59.4815618Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4816019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4816420Z     T=16384,
2025-05-07T20:31:59.4816622Z     D=5120,
2025-05-07T20:31:59.4816823Z     contiguous=False,
2025-05-07T20:31:59.4817060Z     compiled=False,
2025-05-07T20:31:59.4817278Z )
2025-05-07T20:31:59.4817479Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4817871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4818266Z     T=128,
2025-05-07T20:31:59.4818455Z     D=7168,
2025-05-07T20:31:59.4818662Z     contiguous=True,
2025-05-07T20:31:59.4818899Z     compiled=False,
2025-05-07T20:31:59.4819114Z )
2025-05-07T20:31:59.4819314Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4819706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4820100Z     T=128,
2025-05-07T20:31:59.4820289Z     D=7168,
2025-05-07T20:31:59.4820500Z     contiguous=False,
2025-05-07T20:31:59.4820736Z     compiled=False,
2025-05-07T20:31:59.4820948Z )
2025-05-07T20:31:59.4821153Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4821540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4821926Z     T=1,
2025-05-07T20:31:59.4822116Z     D=5120,
2025-05-07T20:31:59.4822319Z     contiguous=False,
2025-05-07T20:31:59.4822547Z     compiled=False,
2025-05-07T20:31:59.4822759Z )
2025-05-07T20:31:59.4822962Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4823345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4823737Z     T=1,
2025-05-07T20:31:59.4823930Z     D=7168,
2025-05-07T20:31:59.4824127Z     contiguous=False,
2025-05-07T20:31:59.4824362Z     compiled=False,
2025-05-07T20:31:59.4824576Z )
2025-05-07T20:31:59.4824791Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4825210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4825632Z     T=4096,
2025-05-07T20:31:59.4825828Z     D=5120,
2025-05-07T20:31:59.4826038Z     contiguous=True,
2025-05-07T20:31:59.4826270Z     compiled=False,
2025-05-07T20:31:59.4826487Z )
2025-05-07T20:31:59.4826744Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4827210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4827864Z     T=128,
2025-05-07T20:31:59.4836020Z     D=7168,
2025-05-07T20:31:59.4836295Z     contiguous=True,
2025-05-07T20:31:59.4836546Z     compiled=True,
2025-05-07T20:31:59.4836769Z )
2025-05-07T20:31:59.4836973Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4837375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4837781Z     T=1,
2025-05-07T20:31:59.4837974Z     D=5120,
2025-05-07T20:31:59.4838188Z     contiguous=False,
2025-05-07T20:31:59.4838427Z     compiled=True,
2025-05-07T20:31:59.4838640Z )
2025-05-07T20:31:59.4838850Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4839259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4839858Z     T=4096,
2025-05-07T20:31:59.4840061Z     D=7168,
2025-05-07T20:31:59.4840371Z     contiguous=True,
2025-05-07T20:31:59.4840605Z     compiled=False,
2025-05-07T20:31:59.4840827Z )
2025-05-07T20:31:59.4841035Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4841420Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4841816Z     T=4096,
2025-05-07T20:31:59.4842012Z     D=7168,
2025-05-07T20:31:59.4842213Z     contiguous=False,
2025-05-07T20:31:59.4842452Z     compiled=True,
2025-05-07T20:31:59.4842668Z )
2025-05-07T20:31:59.4842870Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4843260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4843656Z     T=128,
2025-05-07T20:31:59.4843851Z     D=5120,
2025-05-07T20:31:59.4844049Z     contiguous=True,
2025-05-07T20:31:59.4844283Z     compiled=False,
2025-05-07T20:31:59.4844594Z )
2025-05-07T20:31:59.4844793Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4845191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4845585Z     T=128,
2025-05-07T20:31:59.4845773Z     D=5120,
2025-05-07T20:31:59.4845980Z     contiguous=False,
2025-05-07T20:31:59.4846216Z     compiled=False,
2025-05-07T20:31:59.4846423Z )
2025-05-07T20:31:59.4846631Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4847022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4847407Z     T=1,
2025-05-07T20:31:59.4847607Z     D=5120,
2025-05-07T20:31:59.4847811Z     contiguous=True,
2025-05-07T20:31:59.4848036Z     compiled=False,
2025-05-07T20:31:59.4848252Z )
2025-05-07T20:31:59.4848456Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4848840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4849238Z     T=2048,
2025-05-07T20:31:59.4849432Z     D=7168,
2025-05-07T20:31:59.4849647Z     contiguous=False,
2025-05-07T20:31:59.4849879Z     compiled=True,
2025-05-07T20:31:59.4850098Z )
2025-05-07T20:31:59.4850308Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4850690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4851083Z     T=2048,
2025-05-07T20:31:59.4851283Z     D=7168,
2025-05-07T20:31:59.4851482Z     contiguous=False,
2025-05-07T20:31:59.4851721Z     compiled=False,
2025-05-07T20:31:59.4851941Z )
2025-05-07T20:31:59.4852139Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4852531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4852930Z     T=16384,
2025-05-07T20:31:59.4853126Z     D=7168,
2025-05-07T20:31:59.4853332Z     contiguous=False,
2025-05-07T20:31:59.4853575Z     compiled=True,
2025-05-07T20:31:59.4853784Z )
2025-05-07T20:31:59.4853996Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4854388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4854784Z     T=16384,
2025-05-07T20:31:59.4854987Z     D=7168,
2025-05-07T20:31:59.4855198Z     contiguous=True,
2025-05-07T20:31:59.4855422Z     compiled=True,
2025-05-07T20:31:59.4855636Z )
2025-05-07T20:31:59.4855843Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4856232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4856618Z     T=4096,
2025-05-07T20:31:59.4856813Z     D=7168,
2025-05-07T20:31:59.4857016Z     contiguous=True,
2025-05-07T20:31:59.4857242Z     compiled=True,
2025-05-07T20:31:59.4857457Z )
2025-05-07T20:31:59.4857666Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4858051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4858450Z     T=2048,
2025-05-07T20:31:59.4858643Z     D=5120,
2025-05-07T20:31:59.4858841Z     contiguous=False,
2025-05-07T20:31:59.4859078Z     compiled=False,
2025-05-07T20:31:59.4859293Z )
2025-05-07T20:31:59.4859491Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4859982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4860383Z     T=2048,
2025-05-07T20:31:59.4860575Z     D=5120,
2025-05-07T20:31:59.4860776Z     contiguous=True,
2025-05-07T20:31:59.4861011Z     compiled=False,
2025-05-07T20:31:59.4861218Z )
2025-05-07T20:31:59.4861426Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4861813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4862206Z     T=128,
2025-05-07T20:31:59.4862396Z     D=7168,
2025-05-07T20:31:59.4862602Z     contiguous=False,
2025-05-07T20:31:59.4862842Z     compiled=True,
2025-05-07T20:31:59.4863048Z )
2025-05-07T20:31:59.4863254Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4863642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4864033Z     T=16384,
2025-05-07T20:31:59.4864238Z     D=5120,
2025-05-07T20:31:59.4864443Z     contiguous=True,
2025-05-07T20:31:59.4864671Z     compiled=True,
2025-05-07T20:31:59.4864972Z )
2025-05-07T20:31:59.4865209Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4865620Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4866015Z     T=2048,
2025-05-07T20:31:59.4866216Z     D=5120,
2025-05-07T20:31:59.4866418Z     contiguous=False,
2025-05-07T20:31:59.4866657Z     compiled=True,
2025-05-07T20:31:59.4866870Z )
2025-05-07T20:31:59.4867075Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4867468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4867864Z     T=16384,
2025-05-07T20:31:59.4868060Z     D=5120,
2025-05-07T20:31:59.4868267Z     contiguous=True,
2025-05-07T20:31:59.4868501Z     compiled=False,
2025-05-07T20:31:59.4868707Z )
2025-05-07T20:31:59.4868917Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4869308Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4869700Z     T=16384,
2025-05-07T20:31:59.4869900Z     D=7168,
2025-05-07T20:31:59.4870106Z     contiguous=False,
2025-05-07T20:31:59.4870347Z     compiled=False,
2025-05-07T20:31:59.4870555Z )
2025-05-07T20:31:59.4870762Z Trying example: test_silu_mul(
2025-05-07T20:31:59.4871151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:31:59.4871535Z     T=16384,
2025-05-07T20:31:59.4871737Z     D=7168,
2025-05-07T20:31:59.4871940Z     contiguous=True,
2025-05-07T20:31:59.4872166Z     compiled=False,
2025-05-07T20:31:59.4872378Z )
2025-05-07T20:31:59.4872569Z PASSED
2025-05-07T20:31:59.5492607Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.5493887Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:59.5495312Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.5496846Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.5497881Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.5499247Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.5500693Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.5502411Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.5503858Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.5504960Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:59.5506284Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.5507735Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:59.5508619Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.5509881Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.5511146Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:59.5512233Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:59.5513297Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:59.5514959Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.5516306Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.5517243Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.5518379Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:59.5519467Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:59.5520400Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:59.5521633Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.5523042Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.5524152Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.5525104Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.5526041Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:59.5527114Z W0507 20:31:59.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.5647538Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.5648656Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:59.5650049Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.5651852Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.5652877Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.5654227Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.5655665Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.5657032Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.5658470Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.5659565Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:59.5660877Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.5662170Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:59.5663050Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.5664314Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.5665570Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:59.5666645Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:59.5667714Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:59.5669166Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.5670507Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.5671448Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.5672572Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:59.5673653Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:59.5674456Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:59.5675769Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.5677173Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.5678278Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.5679229Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.5680005Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:59.5681359Z W0507 20:31:59.563000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6034819Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.6036160Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:59.6037861Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.6039684Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.6040921Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6042286Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.6043717Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6045074Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.6046812Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6047922Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:59.6049247Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.6050541Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:59.6051431Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.6052695Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.6054091Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:59.6055206Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:59.6056286Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:59.6057556Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.6058892Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.6059845Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.6060972Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:59.6062056Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:59.6062863Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:59.6064083Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.6065504Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.6066610Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6067563Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6068342Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:59.6069408Z W0507 20:31:59.602000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:31:59.6082856Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:31:59.6083951Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:31:59.6085383Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:31:59.6086851Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:31:59.6087869Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:31:59.6089312Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:31:59.6090738Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:31:59.6092089Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:31:59.6093518Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:31:59.6094615Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:31:59.6095930Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:31:59.6097214Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:31:59.6098093Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.6099347Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:31:59.6100604Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:31:59.6101689Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:31:59.6102744Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:31:59.6104010Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:31:59.6105338Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:31:59.6106279Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:31:59.6107528Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:31:59.6108609Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:31:59.6109415Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:31:59.6110631Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:31:59.6112039Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:31:59.6113219Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:31:59.6114448Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:31:59.6115230Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:31:59.6116294Z W0507 20:31:59.606000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0091661Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0092633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0093207Z     T=1,
2025-05-07T20:32:00.0093407Z     D=5120,
2025-05-07T20:32:00.0093625Z     scale_ub=None,
2025-05-07T20:32:00.0093846Z     contiguous=True,
2025-05-07T20:32:00.0094081Z     compiled=True,
2025-05-07T20:32:00.0094298Z )
2025-05-07T20:32:00.0094630Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0095139Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:00.0095408Z 
2025-05-07T20:32:00.0095497Z     @given(
2025-05-07T20:32:00.0095734Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0096065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0096389Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0096731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0097079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0097380Z     )
2025-05-07T20:32:00.0097751Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0098219Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0098478Z         self,
2025-05-07T20:32:00.0098687Z         T: int,
2025-05-07T20:32:00.0098892Z         D: int,
2025-05-07T20:32:00.0099125Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0099412Z         contiguous: bool,
2025-05-07T20:32:00.0099662Z         compiled: bool,
2025-05-07T20:32:00.0099900Z     ) -> None:
2025-05-07T20:32:00.0100131Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0100382Z     
2025-05-07T20:32:00.0100672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0101032Z     
2025-05-07T20:32:00.0101234Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0101543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0101874Z         x = x_sign * x_clamp
2025-05-07T20:32:00.0102130Z         x0 = x[:, :D]
2025-05-07T20:32:00.0102355Z         x1 = x[:, D:]
2025-05-07T20:32:00.0102576Z     
2025-05-07T20:32:00.0102776Z         if contiguous:
2025-05-07T20:32:00.0103415Z             x0 = x0.contiguous()
2025-05-07T20:32:00.0103693Z             x1 = x1.contiguous()
2025-05-07T20:32:00.0103946Z     
2025-05-07T20:32:00.0104143Z         if scale_ub is not None:
2025-05-07T20:32:00.0104432Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.0104791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.0105116Z             )
2025-05-07T20:32:00.0105344Z         else:
2025-05-07T20:32:00.0105593Z             scale_ub_tensor = None
2025-05-07T20:32:00.0105854Z     
2025-05-07T20:32:00.0106100Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0106462Z             op = silu_mul_quant
2025-05-07T20:32:00.0106729Z             if compiled:
2025-05-07T20:32:00.0106994Z                 op = torch.compile(op)
2025-05-07T20:32:00.0107303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0107598Z     
2025-05-07T20:32:00.0107989Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.0108292Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.0108601Z     
2025-05-07T20:32:00.0108851Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0109198Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.0109510Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.0109843Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.0110222Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.0110544Z     
2025-05-07T20:32:00.0110759Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.0110964Z 
2025-05-07T20:32:00.0111076Z moe/activation_test.py:126: 
2025-05-07T20:32:00.0111388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0111747Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.0112096Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.0112941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.0114069Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.0114653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.0115378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.0116099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.0116865Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.0117639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.0118325Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.0118964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.0119511Z     fn()
2025-05-07T20:32:00.0120054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.0120821Z     self.fn.run(
2025-05-07T20:32:00.0121362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.0121925Z     kernel = self.compile(
2025-05-07T20:32:00.0122497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.0123183Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.0123606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0123849Z 
2025-05-07T20:32:00.0124074Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f29a18cd0>
2025-05-07T20:32:00.0125365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.0126833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f17492f20>}
2025-05-07T20:32:00.0128243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.0129316Z context = <triton._C.libtriton.ir.context object at 0x7f1f2a0bf5f0>
2025-05-07T20:32:00.0129618Z 
2025-05-07T20:32:00.0129801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.0130353Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.0130966Z                            module_map=module_map)
2025-05-07T20:32:00.0131358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.0131736Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.0132016Z E       ^
2025-05-07T20:32:00.0132506Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0133000Z 
2025-05-07T20:32:00.0133440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.0133986Z 
2025-05-07T20:32:00.0134097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.0134541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.0134962Z     T=2048,
2025-05-07T20:32:00.0135169Z     D=5120,
2025-05-07T20:32:00.0135377Z     scale_ub=1200.0,
2025-05-07T20:32:00.0135612Z     contiguous=True,
2025-05-07T20:32:00.0135859Z     compiled=False,
2025-05-07T20:32:00.0136077Z )
2025-05-07T20:32:00.0144797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.0145444Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:00.0145770Z 
2025-05-07T20:32:00.0145854Z     @given(
2025-05-07T20:32:00.0146114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.0146471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.0146819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.0147189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.0147564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.0147890Z     )
2025-05-07T20:32:00.0148291Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.0148819Z     def test_silu_mul_quant(
2025-05-07T20:32:00.0149092Z         self,
2025-05-07T20:32:00.0149305Z         T: int,
2025-05-07T20:32:00.0149520Z         D: int,
2025-05-07T20:32:00.0149762Z         scale_ub: Optional[float],
2025-05-07T20:32:00.0150060Z         contiguous: bool,
2025-05-07T20:32:00.0150329Z         compiled: bool,
2025-05-07T20:32:00.0150577Z     ) -> None:
2025-05-07T20:32:00.0150809Z         torch.manual_seed(2025)
2025-05-07T20:32:00.0151082Z     
2025-05-07T20:32:00.0151389Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.0151788Z     
2025-05-07T20:32:00.0151985Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.0152292Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.0152615Z         x = x_sign * x_clamp
2025-05-07T20:32:00.0152860Z         x0 = x[:, :D]
2025-05-07T20:32:00.0153089Z         x1 = x[:, D:]
2025-05-07T20:32:00.0153307Z     
2025-05-07T20:32:00.0153495Z         if contiguous:
2025-05-07T20:32:00.0153744Z             x0 = x0.contiguous()
2025-05-07T20:32:00.0154018Z             x1 = x1.contiguous()
2025-05-07T20:32:00.0154268Z     
2025-05-07T20:32:00.0154589Z         if scale_ub is not None:
2025-05-07T20:32:00.0154882Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.0155232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.0155608Z             )
2025-05-07T20:32:00.0155809Z         else:
2025-05-07T20:32:00.0156024Z             scale_ub_tensor = None
2025-05-07T20:32:00.0156287Z     
2025-05-07T20:32:00.0156531Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.0156854Z             op = silu_mul_quant
2025-05-07T20:32:00.0157115Z             if compiled:
2025-05-07T20:32:00.0157373Z                 op = torch.compile(op)
2025-05-07T20:32:00.0157682Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0157962Z     
2025-05-07T20:32:00.0158163Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.0158332Z 
2025-05-07T20:32:00.0158441Z moe/activation_test.py:117: 
2025-05-07T20:32:00.0158743Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0159187Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.0159485Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.0160288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.0161013Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.0161577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.0162293Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.0162984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.0163542Z     kernel = self.compile(
2025-05-07T20:32:00.0164114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.0164813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.0165227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.0165473Z 
2025-05-07T20:32:00.0165688Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2897e450>
2025-05-07T20:32:00.0166810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.0168239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f289bfec0>}
2025-05-07T20:32:00.0169626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.0170698Z context = <triton._C.libtriton.ir.context object at 0x7f1f28833eb0>
2025-05-07T20:32:00.0171006Z 
2025-05-07T20:32:00.0171181Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.0171732Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.0172220Z                            module_map=module_map)
2025-05-07T20:32:00.0172610Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.0172981Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.0173246Z E       ^
2025-05-07T20:32:00.0173732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.0174203Z 
2025-05-07T20:32:00.0174634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.2738915Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.2740066Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:00.2741625Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.2743236Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.2744256Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.2745814Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.2747253Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.2748613Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.2750047Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.2751149Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:00.2752474Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.2753770Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:00.2754651Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.2755913Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.2757180Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:00.2758275Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:00.2759347Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:00.2760737Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.2762079Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.2763107Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.2764255Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:00.2765344Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:00.2766156Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:00.2767379Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.2768786Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.2769981Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.2770944Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.2771725Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:00.2772797Z W0507 20:32:00.269000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.3435666Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.3436935Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:00.3438331Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.3439829Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.3440978Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.3442351Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.3443797Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.3445165Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.3446601Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.3447699Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:00.3449358Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.3450671Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:00.3451562Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.3452823Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.3454096Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:00.3455185Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:00.3456396Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:00.3457677Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.3459015Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.3459958Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.3461105Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:00.3462209Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:00.3463025Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:00.3464256Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.3465675Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.3466788Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.3467760Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.3468549Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:00.3469621Z W0507 20:32:00.340000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.5497984Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.5499193Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:00.5500962Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.5502808Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.5504055Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.5505722Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.5507490Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.5509302Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.5511058Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.5512384Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:00.5514191Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.5515512Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:00.5516396Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.5517656Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.5518931Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:00.5520017Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:00.5521203Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:00.5522477Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.5523812Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.5524753Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.5525893Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:00.5527119Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:00.5527943Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:00.5529170Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.5530584Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.5531689Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.5532644Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.5533566Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:00.5534642Z W0507 20:32:00.546000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.5596856Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:00.5598169Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:00.5599557Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:00.5601143Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:00.5602161Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:00.5603517Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:00.5604958Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.5606331Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:00.5607770Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.5608862Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:00.5610174Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:00.5611469Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:00.5612476Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.5613992Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:00.5615264Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:00.5616339Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:00.5617404Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:00.5618687Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:00.5620202Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:00.5621149Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:00.5622281Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:00.5623365Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:00.5624175Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:00.5625405Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:00.5626816Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:00.5627919Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.5628882Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.5629664Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:00.5630745Z W0507 20:32:00.556000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.8650036Z 
2025-05-07T20:32:00.8650459Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8651300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8651896Z     T=2048,
2025-05-07T20:32:00.8652148Z     D=5120,
2025-05-07T20:32:00.8652349Z     scale_ub=1200.0,
2025-05-07T20:32:00.8652587Z     contiguous=True,
2025-05-07T20:32:00.8652824Z     compiled=True,
2025-05-07T20:32:00.8653043Z )
2025-05-07T20:32:00.8653389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8653915Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:00.8654198Z 
2025-05-07T20:32:00.8654284Z     @given(
2025-05-07T20:32:00.8654533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8654892Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8655609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8655959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8656311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8656613Z     )
2025-05-07T20:32:00.8656976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8657441Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8657695Z         self,
2025-05-07T20:32:00.8657896Z         T: int,
2025-05-07T20:32:00.8658105Z         D: int,
2025-05-07T20:32:00.8658334Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8658614Z         contiguous: bool,
2025-05-07T20:32:00.8658869Z         compiled: bool,
2025-05-07T20:32:00.8659108Z     ) -> None:
2025-05-07T20:32:00.8659329Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8659584Z     
2025-05-07T20:32:00.8659875Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8660387Z     
2025-05-07T20:32:00.8660593Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.8660901Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.8661229Z         x = x_sign * x_clamp
2025-05-07T20:32:00.8661477Z         x0 = x[:, :D]
2025-05-07T20:32:00.8661706Z         x1 = x[:, D:]
2025-05-07T20:32:00.8661931Z     
2025-05-07T20:32:00.8662122Z         if contiguous:
2025-05-07T20:32:00.8662369Z             x0 = x0.contiguous()
2025-05-07T20:32:00.8662642Z             x1 = x1.contiguous()
2025-05-07T20:32:00.8662889Z     
2025-05-07T20:32:00.8663097Z         if scale_ub is not None:
2025-05-07T20:32:00.8663388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.8663737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.8664067Z             )
2025-05-07T20:32:00.8664274Z         else:
2025-05-07T20:32:00.8664493Z             scale_ub_tensor = None
2025-05-07T20:32:00.8664771Z     
2025-05-07T20:32:00.8665023Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.8665360Z             op = silu_mul_quant
2025-05-07T20:32:00.8665620Z             if compiled:
2025-05-07T20:32:00.8665915Z                 op = torch.compile(op)
2025-05-07T20:32:00.8666230Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.8666525Z     
2025-05-07T20:32:00.8666728Z         y_fp8, y_scale = fn()
2025-05-07T20:32:00.8667029Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:00.8667340Z     
2025-05-07T20:32:00.8667594Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.8667945Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:00.8668257Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:00.8668591Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:00.8668969Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.8669306Z     
2025-05-07T20:32:00.8669528Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:00.8669736Z 
2025-05-07T20:32:00.8669846Z moe/activation_test.py:126: 
2025-05-07T20:32:00.8670165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.8670520Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:00.8670871Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:00.8671696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:00.8672485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:00.8673063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.8673781Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.8674508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:00.8675363Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:00.8676187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:00.8676859Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:00.8677498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:00.8678050Z     fn()
2025-05-07T20:32:00.8678589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:00.8679199Z     self.fn.run(
2025-05-07T20:32:00.8679699Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.8680384Z     kernel = self.compile(
2025-05-07T20:32:00.8680957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.8681741Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.8682164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.8682408Z 
2025-05-07T20:32:00.8682633Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f236ddb50>
2025-05-07T20:32:00.8683762Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.8685214Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f288f7240>}
2025-05-07T20:32:00.8686673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.8687752Z context = <triton._C.libtriton.ir.context object at 0x7f1f1637c9f0>
2025-05-07T20:32:00.8688058Z 
2025-05-07T20:32:00.8688244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.8688797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.8689294Z                            module_map=module_map)
2025-05-07T20:32:00.8689687Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.8690061Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:00.8690348Z E       ^
2025-05-07T20:32:00.8690840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.8691310Z 
2025-05-07T20:32:00.8691755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:00.8692298Z 
2025-05-07T20:32:00.8692413Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:00.8692860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:00.8693289Z     T=16384,
2025-05-07T20:32:00.8693492Z     D=7168,
2025-05-07T20:32:00.8693704Z     scale_ub=1200.0,
2025-05-07T20:32:00.8693946Z     contiguous=False,
2025-05-07T20:32:00.8694184Z     compiled=False,
2025-05-07T20:32:00.8694402Z )
2025-05-07T20:32:00.8694742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:00.8695281Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:00.8695582Z 
2025-05-07T20:32:00.8695667Z     @given(
2025-05-07T20:32:00.8695917Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:00.8696256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:00.8696578Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:00.8696935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:00.8697369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:00.8697676Z     )
2025-05-07T20:32:00.8698048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:00.8698517Z     def test_silu_mul_quant(
2025-05-07T20:32:00.8698770Z         self,
2025-05-07T20:32:00.8698979Z         T: int,
2025-05-07T20:32:00.8699188Z         D: int,
2025-05-07T20:32:00.8699419Z         scale_ub: Optional[float],
2025-05-07T20:32:00.8699701Z         contiguous: bool,
2025-05-07T20:32:00.8699959Z         compiled: bool,
2025-05-07T20:32:00.8700197Z     ) -> None:
2025-05-07T20:32:00.8700423Z         torch.manual_seed(2025)
2025-05-07T20:32:00.8700685Z     
2025-05-07T20:32:00.8700973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:00.8701348Z     
2025-05-07T20:32:00.8701552Z         x_sign = torch.sign(x)
2025-05-07T20:32:00.8701948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:00.8702283Z         x = x_sign * x_clamp
2025-05-07T20:32:00.8702532Z         x0 = x[:, :D]
2025-05-07T20:32:00.8702767Z         x1 = x[:, D:]
2025-05-07T20:32:00.8702990Z     
2025-05-07T20:32:00.8703186Z         if contiguous:
2025-05-07T20:32:00.8703433Z             x0 = x0.contiguous()
2025-05-07T20:32:00.8703708Z             x1 = x1.contiguous()
2025-05-07T20:32:00.8703958Z     
2025-05-07T20:32:00.8704162Z         if scale_ub is not None:
2025-05-07T20:32:00.8704453Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:00.8704813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:00.8705136Z             )
2025-05-07T20:32:00.8705345Z         else:
2025-05-07T20:32:00.8705605Z             scale_ub_tensor = None
2025-05-07T20:32:00.8705875Z     
2025-05-07T20:32:00.8706122Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:00.8706454Z             op = silu_mul_quant
2025-05-07T20:32:00.8706718Z             if compiled:
2025-05-07T20:32:00.8706986Z                 op = torch.compile(op)
2025-05-07T20:32:00.8707303Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.8707588Z     
2025-05-07T20:32:00.8707794Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:00.8707966Z 
2025-05-07T20:32:00.8708077Z moe/activation_test.py:117: 
2025-05-07T20:32:00.8708383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.8708736Z moe/activation_test.py:115: in fn
2025-05-07T20:32:00.8709037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:00.8709760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:00.8710497Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:00.8711062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:00.8711794Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:00.8712498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:00.8713067Z     kernel = self.compile(
2025-05-07T20:32:00.8713913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:00.8714611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:00.8715038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:00.8715280Z 
2025-05-07T20:32:00.8715505Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f29918850>
2025-05-07T20:32:00.8716635Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:00.8718251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f164b6e80>}
2025-05-07T20:32:00.8719805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:00.8729309Z context = <triton._C.libtriton.ir.context object at 0x7f1f1489a530>
2025-05-07T20:32:00.8729663Z 
2025-05-07T20:32:00.8729850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:00.8730418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:00.8730922Z                            module_map=module_map)
2025-05-07T20:32:00.8731306Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:00.8731889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:00.8732169Z E       ^
2025-05-07T20:32:00.8732668Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:00.8733141Z 
2025-05-07T20:32:00.8733583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.0487112Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.0488352Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:01.0489760Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.0491306Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.0492331Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.0493705Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.0495155Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.0496537Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.0497988Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.0499088Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:01.0500408Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.0501714Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.0502952Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.0504223Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.0505478Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:01.0506561Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:01.0507630Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:01.0508912Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.0510384Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.0511322Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.0512462Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:01.0513933Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:01.0514757Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:01.0515989Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.0517392Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.0518498Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.0519452Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.0520342Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:01.0521411Z W0507 20:32:01.045000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.0992008Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.0993159Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:01.0994586Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.0996139Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.0998036Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.0999457Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.1001053Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.1002477Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.1004112Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.1005258Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:01.1006627Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.1007979Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.1008901Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.1010217Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.1011534Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:01.1012660Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:01.1014015Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:01.1015343Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.1016745Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.1017729Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.1018904Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:01.1020037Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:01.1020882Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:01.1022276Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.1023753Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.1024911Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.1025906Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.1026705Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:01.1027800Z W0507 20:32:01.095000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2709669Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.2710951Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:01.2712365Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.2714114Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.2715159Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.2716559Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.2718031Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2719419Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.2720965Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2722102Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:01.2723457Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.2724784Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.2725679Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.2726969Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.2728634Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:01.2729746Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:01.2730837Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:01.2732129Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.2733500Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.2734779Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.2735948Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:01.2737064Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:01.2737882Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:01.2739138Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.2740586Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.2741728Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2742698Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2743504Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:01.2744604Z W0507 20:32:01.267000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.2800815Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:01.2802110Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:01.2803530Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:01.2805033Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:01.2806080Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:01.2807573Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:01.2809052Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.2810439Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:01.2811895Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.2813008Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:01.2814780Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:01.2816108Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:01.2817011Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.2818289Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:01.2819577Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:01.2820696Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:01.2821789Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:01.2823090Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:01.2824448Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:01.2825415Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:01.2826631Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:01.2827747Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:01.2828567Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:01.2829820Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:01.2831264Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:01.2832550Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.2833537Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.2834334Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:01.2835426Z W0507 20:32:01.276000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9487019Z 
2025-05-07T20:32:01.9487394Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9487946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9488556Z     T=1,
2025-05-07T20:32:01.9488829Z     D=7168,
2025-05-07T20:32:01.9489106Z     scale_ub=None,
2025-05-07T20:32:01.9489421Z     contiguous=True,
2025-05-07T20:32:01.9490230Z     compiled=True,
2025-05-07T20:32:01.9490506Z )
2025-05-07T20:32:01.9490937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9491453Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:01.9491727Z 
2025-05-07T20:32:01.9491814Z     @given(
2025-05-07T20:32:01.9492066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9492395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9492710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9493054Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9493399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9493692Z     )
2025-05-07T20:32:01.9494060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9494522Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9494775Z         self,
2025-05-07T20:32:01.9494985Z         T: int,
2025-05-07T20:32:01.9495192Z         D: int,
2025-05-07T20:32:01.9495425Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9495704Z         contiguous: bool,
2025-05-07T20:32:01.9495957Z         compiled: bool,
2025-05-07T20:32:01.9496196Z     ) -> None:
2025-05-07T20:32:01.9496419Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9496676Z     
2025-05-07T20:32:01.9496962Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9497317Z     
2025-05-07T20:32:01.9497522Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9497827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9498147Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9498398Z         x0 = x[:, :D]
2025-05-07T20:32:01.9498625Z         x1 = x[:, D:]
2025-05-07T20:32:01.9498837Z     
2025-05-07T20:32:01.9499035Z         if contiguous:
2025-05-07T20:32:01.9499277Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9499543Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9499806Z     
2025-05-07T20:32:01.9500008Z         if scale_ub is not None:
2025-05-07T20:32:01.9500298Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9500645Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9500973Z             )
2025-05-07T20:32:01.9501177Z         else:
2025-05-07T20:32:01.9501395Z             scale_ub_tensor = None
2025-05-07T20:32:01.9501660Z     
2025-05-07T20:32:01.9501908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9502233Z             op = silu_mul_quant
2025-05-07T20:32:01.9502500Z             if compiled:
2025-05-07T20:32:01.9502794Z                 op = torch.compile(op)
2025-05-07T20:32:01.9503109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9503398Z     
2025-05-07T20:32:01.9503598Z         y_fp8, y_scale = fn()
2025-05-07T20:32:01.9503899Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:01.9504207Z     
2025-05-07T20:32:01.9504456Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9504976Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:01.9505292Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:01.9505619Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:01.9505997Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9506325Z     
2025-05-07T20:32:01.9506539Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:01.9506743Z 
2025-05-07T20:32:01.9506849Z moe/activation_test.py:126: 
2025-05-07T20:32:01.9507162Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9507515Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:01.9507852Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:01.9508680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:01.9509546Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:01.9510124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9510834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9511556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:01.9512316Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:01.9513077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:01.9514195Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:01.9514830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:01.9515373Z     fn()
2025-05-07T20:32:01.9515911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:01.9516526Z     self.fn.run(
2025-05-07T20:32:01.9517017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9517566Z     kernel = self.compile(
2025-05-07T20:32:01.9518133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9518816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9519230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9519471Z 
2025-05-07T20:32:01.9519689Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f298d21d0>
2025-05-07T20:32:01.9520913Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9522363Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f164b6700>}
2025-05-07T20:32:01.9523757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9524819Z context = <triton._C.libtriton.ir.context object at 0x7f1f1467abb0>
2025-05-07T20:32:01.9525121Z 
2025-05-07T20:32:01.9525297Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9525851Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9526346Z                            module_map=module_map)
2025-05-07T20:32:01.9526729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9527253Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:01.9527545Z E       ^
2025-05-07T20:32:01.9528036Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9528506Z 
2025-05-07T20:32:01.9528941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:01.9529482Z 
2025-05-07T20:32:01.9529593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:01.9530031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:01.9530456Z     T=4096,
2025-05-07T20:32:01.9530652Z     D=5120,
2025-05-07T20:32:01.9530858Z     scale_ub=None,
2025-05-07T20:32:01.9531089Z     contiguous=False,
2025-05-07T20:32:01.9531324Z     compiled=False,
2025-05-07T20:32:01.9531544Z )
2025-05-07T20:32:01.9531882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:01.9532518Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:01.9532812Z 
2025-05-07T20:32:01.9532895Z     @given(
2025-05-07T20:32:01.9533142Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:01.9533463Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:01.9533791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:01.9534141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:01.9534490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:01.9534788Z     )
2025-05-07T20:32:01.9535156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:01.9535619Z     def test_silu_mul_quant(
2025-05-07T20:32:01.9535896Z         self,
2025-05-07T20:32:01.9536128Z         T: int,
2025-05-07T20:32:01.9536336Z         D: int,
2025-05-07T20:32:01.9536567Z         scale_ub: Optional[float],
2025-05-07T20:32:01.9536865Z         contiguous: bool,
2025-05-07T20:32:01.9537119Z         compiled: bool,
2025-05-07T20:32:01.9537353Z     ) -> None:
2025-05-07T20:32:01.9537583Z         torch.manual_seed(2025)
2025-05-07T20:32:01.9537837Z     
2025-05-07T20:32:01.9538121Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:01.9538487Z     
2025-05-07T20:32:01.9538693Z         x_sign = torch.sign(x)
2025-05-07T20:32:01.9538997Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:01.9539325Z         x = x_sign * x_clamp
2025-05-07T20:32:01.9539575Z         x0 = x[:, :D]
2025-05-07T20:32:01.9539802Z         x1 = x[:, D:]
2025-05-07T20:32:01.9540016Z     
2025-05-07T20:32:01.9540211Z         if contiguous:
2025-05-07T20:32:01.9540453Z             x0 = x0.contiguous()
2025-05-07T20:32:01.9540718Z             x1 = x1.contiguous()
2025-05-07T20:32:01.9540968Z     
2025-05-07T20:32:01.9541172Z         if scale_ub is not None:
2025-05-07T20:32:01.9541458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:01.9541825Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:01.9542151Z             )
2025-05-07T20:32:01.9542351Z         else:
2025-05-07T20:32:01.9542574Z             scale_ub_tensor = None
2025-05-07T20:32:01.9542840Z     
2025-05-07T20:32:01.9543080Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:01.9543413Z             op = silu_mul_quant
2025-05-07T20:32:01.9543679Z             if compiled:
2025-05-07T20:32:01.9543934Z                 op = torch.compile(op)
2025-05-07T20:32:01.9544246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9544539Z     
2025-05-07T20:32:01.9544736Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:01.9544914Z 
2025-05-07T20:32:01.9545018Z moe/activation_test.py:117: 
2025-05-07T20:32:01.9545328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9545682Z moe/activation_test.py:115: in fn
2025-05-07T20:32:01.9545973Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:01.9546789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:01.9547517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:01.9548077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:01.9548983Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:01.9549680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:01.9550236Z     kernel = self.compile(
2025-05-07T20:32:01.9550797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:01.9551492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:01.9551909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:01.9552235Z 
2025-05-07T20:32:01.9552461Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2996edd0>
2025-05-07T20:32:01.9553578Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:01.9555004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f28971d00>}
2025-05-07T20:32:01.9556397Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:01.9557460Z context = <triton._C.libtriton.ir.context object at 0x7f1f146b56f0>
2025-05-07T20:32:01.9557762Z 
2025-05-07T20:32:01.9557943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:01.9558499Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:01.9558992Z                            module_map=module_map)
2025-05-07T20:32:01.9559377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:01.9559745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:01.9560018Z E       ^
2025-05-07T20:32:01.9560580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:01.9561048Z 
2025-05-07T20:32:01.9561486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:02.2249637Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.2250989Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:02.2252409Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.2253923Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.2254959Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:02.2256667Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.2258144Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.2259517Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.2260974Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.2262085Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:02.2263584Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.2264904Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:02.2265797Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.2267071Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.2268348Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:02.2269454Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:02.2270545Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:02.2271834Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.2273190Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.2274150Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.2275309Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:02.2276415Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:02.2277232Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:02.2278472Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.2279904Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.2281127Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.2282173Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.2282966Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:02.2284050Z W0507 20:32:02.221000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.3934364Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.3935820Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:02.3937591Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.3939105Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.3940147Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:02.3941537Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.3942999Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.3944393Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.3945851Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.3946956Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:02.3948297Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.3949624Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:02.3950518Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.3951790Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.3953061Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:02.3954162Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:02.3955385Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:02.3956729Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.3958078Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.3967141Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.3968342Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:02.3970209Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:02.3971188Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:02.3972698Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.3974462Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.3975818Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.3976976Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.3977916Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:02.3979227Z W0507 20:32:02.389000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.6523505Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.6524638Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:02.6526032Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.6527556Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.6528571Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:02.6529934Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.6531370Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.6533073Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.6534521Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.6535612Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:02.6536989Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.6538285Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:02.6539311Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.6540572Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.6541820Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:02.6542898Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:02.6543968Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:02.6545242Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.6546631Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.6547567Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.6548701Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:02.6549792Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:02.6550601Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:02.6551824Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.6553235Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.6554344Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.6555289Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.6556073Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:02.6557221Z W0507 20:32:02.648000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:02.6621898Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:02.6623290Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:02.6624684Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:02.6626166Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:02.6627385Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:02.6628747Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:02.6630193Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:02.6631565Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:02.6633012Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:02.6634112Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:02.6635429Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:02.6636784Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:02.6637673Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.6638944Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:02.6640319Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:02.6641398Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:02.6642465Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:02.6643743Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:02.6645208Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:02.6646155Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:02.6647346Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:02.6648435Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:02.6649252Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:02.6650487Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:02.6651973Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:02.6653084Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:02.6654042Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:02.6654830Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:02.6655892Z W0507 20:32:02.658000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.8376398Z 
2025-05-07T20:32:03.8377102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.8377876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.8378564Z     T=4096,
2025-05-07T20:32:03.8378926Z     D=7168,
2025-05-07T20:32:03.8379239Z     scale_ub=None,
2025-05-07T20:32:03.8379591Z     contiguous=False,
2025-05-07T20:32:03.8379971Z     compiled=False,
2025-05-07T20:32:03.8380307Z )
2025-05-07T20:32:03.8380853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.8381713Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:03.8382187Z 
2025-05-07T20:32:03.8382310Z     @given(
2025-05-07T20:32:03.8382688Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.8383180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.8383668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.8384256Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.8384826Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.8385321Z     )
2025-05-07T20:32:03.8385901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.8386710Z     def test_silu_mul_quant(
2025-05-07T20:32:03.8387106Z         self,
2025-05-07T20:32:03.8387414Z         T: int,
2025-05-07T20:32:03.8387732Z         D: int,
2025-05-07T20:32:03.8388093Z         scale_ub: Optional[float],
2025-05-07T20:32:03.8388545Z         contiguous: bool,
2025-05-07T20:32:03.8388950Z         compiled: bool,
2025-05-07T20:32:03.8389320Z     ) -> None:
2025-05-07T20:32:03.8389669Z         torch.manual_seed(2025)
2025-05-07T20:32:03.8390079Z     
2025-05-07T20:32:03.8390542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.8391118Z     
2025-05-07T20:32:03.8391436Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.8391921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.8392865Z         x = x_sign * x_clamp
2025-05-07T20:32:03.8393273Z         x0 = x[:, :D]
2025-05-07T20:32:03.8393630Z         x1 = x[:, D:]
2025-05-07T20:32:03.8393975Z     
2025-05-07T20:32:03.8394270Z         if contiguous:
2025-05-07T20:32:03.8394648Z             x0 = x0.contiguous()
2025-05-07T20:32:03.8395080Z             x1 = x1.contiguous()
2025-05-07T20:32:03.8395472Z     
2025-05-07T20:32:03.8395787Z         if scale_ub is not None:
2025-05-07T20:32:03.8396256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.8396811Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.8397329Z             )
2025-05-07T20:32:03.8397641Z         else:
2025-05-07T20:32:03.8397975Z             scale_ub_tensor = None
2025-05-07T20:32:03.8398385Z     
2025-05-07T20:32:03.8398783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.8399310Z             op = silu_mul_quant
2025-05-07T20:32:03.8399953Z             if compiled:
2025-05-07T20:32:03.8400522Z                 op = torch.compile(op)
2025-05-07T20:32:03.8401036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.8401485Z     
2025-05-07T20:32:03.8401805Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:03.8402054Z 
2025-05-07T20:32:03.8402209Z moe/activation_test.py:117: 
2025-05-07T20:32:03.8402642Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8403137Z moe/activation_test.py:115: in fn
2025-05-07T20:32:03.8403559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.8404569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:03.8405592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:03.8406389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.8407401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.8408391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.8409171Z     kernel = self.compile(
2025-05-07T20:32:03.8409942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.8410894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.8411460Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8411797Z 
2025-05-07T20:32:03.8412084Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f149d1c50>
2025-05-07T20:32:03.8414133Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.8416325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1663e0c0>}
2025-05-07T20:32:03.8418326Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.8419829Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3eaaaf0>
2025-05-07T20:32:03.8420260Z 
2025-05-07T20:32:03.8420499Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.8421273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.8421958Z                            module_map=module_map)
2025-05-07T20:32:03.8422476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.8422975Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:03.8423355Z E       ^
2025-05-07T20:32:03.8424235Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.8424908Z 
2025-05-07T20:32:03.8425521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:03.8426303Z 
2025-05-07T20:32:03.8426477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:03.8427200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:03.8427907Z     T=128,
2025-05-07T20:32:03.8428191Z     D=7168,
2025-05-07T20:32:03.8428475Z     scale_ub=None,
2025-05-07T20:32:03.8428792Z     contiguous=False,
2025-05-07T20:32:03.8429116Z     compiled=True,
2025-05-07T20:32:03.8429429Z )
2025-05-07T20:32:03.8429937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:03.8430715Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:03.8431394Z 
2025-05-07T20:32:03.8431513Z     @given(
2025-05-07T20:32:03.8431897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:03.8432396Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:03.8432885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:03.8433465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:03.8433997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:03.8434431Z     )
2025-05-07T20:32:03.8435003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:03.8435813Z     def test_silu_mul_quant(
2025-05-07T20:32:03.8436215Z         self,
2025-05-07T20:32:03.8436547Z         T: int,
2025-05-07T20:32:03.8436886Z         D: int,
2025-05-07T20:32:03.8437240Z         scale_ub: Optional[float],
2025-05-07T20:32:03.8437714Z         contiguous: bool,
2025-05-07T20:32:03.8438120Z         compiled: bool,
2025-05-07T20:32:03.8438511Z     ) -> None:
2025-05-07T20:32:03.8438874Z         torch.manual_seed(2025)
2025-05-07T20:32:03.8439289Z     
2025-05-07T20:32:03.8439753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:03.8440423Z     
2025-05-07T20:32:03.8440734Z         x_sign = torch.sign(x)
2025-05-07T20:32:03.8441180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:03.8441662Z         x = x_sign * x_clamp
2025-05-07T20:32:03.8442027Z         x0 = x[:, :D]
2025-05-07T20:32:03.8442385Z         x1 = x[:, D:]
2025-05-07T20:32:03.8442724Z     
2025-05-07T20:32:03.8443007Z         if contiguous:
2025-05-07T20:32:03.8443379Z             x0 = x0.contiguous()
2025-05-07T20:32:03.8443785Z             x1 = x1.contiguous()
2025-05-07T20:32:03.8444183Z     
2025-05-07T20:32:03.8444495Z         if scale_ub is not None:
2025-05-07T20:32:03.8444937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:03.8445506Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:03.8446033Z             )
2025-05-07T20:32:03.8446349Z         else:
2025-05-07T20:32:03.8446712Z             scale_ub_tensor = None
2025-05-07T20:32:03.8447140Z     
2025-05-07T20:32:03.8447525Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.8448057Z             op = silu_mul_quant
2025-05-07T20:32:03.8448478Z             if compiled:
2025-05-07T20:32:03.8448881Z                 op = torch.compile(op)
2025-05-07T20:32:03.8449351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:03.8449800Z     
2025-05-07T20:32:03.8450130Z         y_fp8, y_scale = fn()
2025-05-07T20:32:03.8450623Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:03.8451094Z     
2025-05-07T20:32:03.8451469Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:03.8451963Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:03.8452397Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:03.8452934Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:03.8453720Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.8454285Z     
2025-05-07T20:32:03.8454630Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:03.8454984Z 
2025-05-07T20:32:03.8455160Z moe/activation_test.py:126: 
2025-05-07T20:32:03.8455672Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8456272Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:03.8456847Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:03.8458326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:03.8459767Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:03.8460777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:03.8462063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:03.8463350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:03.8464614Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:03.8465985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:03.8467193Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:03.8468319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:03.8469298Z     fn()
2025-05-07T20:32:03.8470253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:03.8471329Z     self.fn.run(
2025-05-07T20:32:03.8472136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:03.8473113Z     kernel = self.compile(
2025-05-07T20:32:03.8474115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:03.8475331Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:03.8476045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:03.8476461Z 
2025-05-07T20:32:03.8476836Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f149d08d0>
2025-05-07T20:32:03.8478875Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:03.8481634Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1663d940>}
2025-05-07T20:32:03.8484241Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:03.8486209Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3b0dbb0>
2025-05-07T20:32:03.8486794Z 
2025-05-07T20:32:03.8487094Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:03.8488056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:03.8488923Z                            module_map=module_map)
2025-05-07T20:32:03.8489565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:03.8490197Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:03.8490650Z E       ^
2025-05-07T20:32:03.8491507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:03.8492380Z 
2025-05-07T20:32:03.8493349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.0869359Z 
2025-05-07T20:32:04.0869897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0870651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0871426Z     T=128,
2025-05-07T20:32:04.0871729Z     D=7168,
2025-05-07T20:32:04.0872034Z     scale_ub=None,
2025-05-07T20:32:04.0872378Z     contiguous=False,
2025-05-07T20:32:04.0872745Z     compiled=False,
2025-05-07T20:32:04.0873078Z )
2025-05-07T20:32:04.0873613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.0874454Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:04.0874913Z 
2025-05-07T20:32:04.0875042Z     @given(
2025-05-07T20:32:04.0875401Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.0876349Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.0876865Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.0877428Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.0877972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.0878452Z     )
2025-05-07T20:32:04.0879039Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.0879803Z     def test_silu_mul_quant(
2025-05-07T20:32:04.0880325Z         self,
2025-05-07T20:32:04.0880633Z         T: int,
2025-05-07T20:32:04.0880956Z         D: int,
2025-05-07T20:32:04.0881322Z         scale_ub: Optional[float],
2025-05-07T20:32:04.0881782Z         contiguous: bool,
2025-05-07T20:32:04.0882167Z         compiled: bool,
2025-05-07T20:32:04.0882536Z     ) -> None:
2025-05-07T20:32:04.0882885Z         torch.manual_seed(2025)
2025-05-07T20:32:04.0883283Z     
2025-05-07T20:32:04.0883747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.0884345Z     
2025-05-07T20:32:04.0884654Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.0885144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.0885664Z         x = x_sign * x_clamp
2025-05-07T20:32:04.0886051Z         x0 = x[:, :D]
2025-05-07T20:32:04.0886401Z         x1 = x[:, D:]
2025-05-07T20:32:04.0886739Z     
2025-05-07T20:32:04.0887030Z         if contiguous:
2025-05-07T20:32:04.0887407Z             x0 = x0.contiguous()
2025-05-07T20:32:04.0887832Z             x1 = x1.contiguous()
2025-05-07T20:32:04.0888221Z     
2025-05-07T20:32:04.0888527Z         if scale_ub is not None:
2025-05-07T20:32:04.0888980Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.0889529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.0890046Z             )
2025-05-07T20:32:04.0890350Z         else:
2025-05-07T20:32:04.0890681Z             scale_ub_tensor = None
2025-05-07T20:32:04.0891096Z     
2025-05-07T20:32:04.0891478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.0892036Z             op = silu_mul_quant
2025-05-07T20:32:04.0892442Z             if compiled:
2025-05-07T20:32:04.0892851Z                 op = torch.compile(op)
2025-05-07T20:32:04.0893358Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0893815Z     
2025-05-07T20:32:04.0894134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.0894409Z 
2025-05-07T20:32:04.0894564Z moe/activation_test.py:117: 
2025-05-07T20:32:04.0894988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0895508Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.0895943Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0897055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.0898185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.0899315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.0900549Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.0901679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.0902552Z     kernel = self.compile(
2025-05-07T20:32:04.0903460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.0904583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0905282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0905705Z 
2025-05-07T20:32:04.0906061Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f289dc4d0>
2025-05-07T20:32:04.0908003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.0910557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f143fa700>}
2025-05-07T20:32:04.0912975Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.0915177Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3b351b0>
2025-05-07T20:32:04.0915702Z 
2025-05-07T20:32:04.0915996Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.0916968Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0917812Z                            module_map=module_map)
2025-05-07T20:32:04.0918475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0919093Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0919538Z E       ^
2025-05-07T20:32:04.0920431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0921252Z 
2025-05-07T20:32:04.0922018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.0922968Z 
2025-05-07T20:32:04.0923159Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0923877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0924581Z     T=4096,
2025-05-07T20:32:04.0924877Z     D=5120,
2025-05-07T20:32:04.0925189Z     scale_ub=1200.0,
2025-05-07T20:32:04.0925566Z     contiguous=True,
2025-05-07T20:32:04.0925944Z     compiled=False,
2025-05-07T20:32:04.0926292Z )
2025-05-07T20:32:04.0926833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.0927710Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:04.0928203Z 
2025-05-07T20:32:04.0928335Z     @given(
2025-05-07T20:32:04.0928707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.0929239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.0929785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.0930372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.0930951Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.0931433Z     )
2025-05-07T20:32:04.0932024Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.0932808Z     def test_silu_mul_quant(
2025-05-07T20:32:04.0933220Z         self,
2025-05-07T20:32:04.0933546Z         T: int,
2025-05-07T20:32:04.0933880Z         D: int,
2025-05-07T20:32:04.0934259Z         scale_ub: Optional[float],
2025-05-07T20:32:04.0934897Z         contiguous: bool,
2025-05-07T20:32:04.0935297Z         compiled: bool,
2025-05-07T20:32:04.0935690Z     ) -> None:
2025-05-07T20:32:04.0936062Z         torch.manual_seed(2025)
2025-05-07T20:32:04.0936478Z     
2025-05-07T20:32:04.0936954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.0937573Z     
2025-05-07T20:32:04.0937891Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.0938353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.0938878Z         x = x_sign * x_clamp
2025-05-07T20:32:04.0950701Z         x0 = x[:, :D]
2025-05-07T20:32:04.0951112Z         x1 = x[:, D:]
2025-05-07T20:32:04.0951458Z     
2025-05-07T20:32:04.0951770Z         if contiguous:
2025-05-07T20:32:04.0952153Z             x0 = x0.contiguous()
2025-05-07T20:32:04.0952582Z             x1 = x1.contiguous()
2025-05-07T20:32:04.0952994Z     
2025-05-07T20:32:04.0953318Z         if scale_ub is not None:
2025-05-07T20:32:04.0954019Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.0954584Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.0955108Z             )
2025-05-07T20:32:04.0955440Z         else:
2025-05-07T20:32:04.0955805Z             scale_ub_tensor = None
2025-05-07T20:32:04.0956221Z     
2025-05-07T20:32:04.0956580Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.0957060Z             op = silu_mul_quant
2025-05-07T20:32:04.0957423Z             if compiled:
2025-05-07T20:32:04.0957822Z                 op = torch.compile(op)
2025-05-07T20:32:04.0958360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0958853Z     
2025-05-07T20:32:04.0959182Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:04.0959479Z 
2025-05-07T20:32:04.0959651Z moe/activation_test.py:117: 
2025-05-07T20:32:04.0960256Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0960858Z moe/activation_test.py:115: in fn
2025-05-07T20:32:04.0961355Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.0962655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:04.0963969Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:04.0964983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.0966269Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.0967519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.0968438Z     kernel = self.compile(
2025-05-07T20:32:04.0969354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.0970503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.0971251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.0971679Z 
2025-05-07T20:32:04.0972056Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3fbbe50>
2025-05-07T20:32:04.0973766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.0976223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f143f8220>}
2025-05-07T20:32:04.0978905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.0980860Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3b98cf0>
2025-05-07T20:32:04.0981396Z 
2025-05-07T20:32:04.0981829Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.0982795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.0983660Z                            module_map=module_map)
2025-05-07T20:32:04.0984312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.0984933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.0985387Z E       ^
2025-05-07T20:32:04.0986240Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.0987094Z 
2025-05-07T20:32:04.0987891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.0988879Z 
2025-05-07T20:32:04.0989058Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:04.0989944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:04.0990687Z     T=1,
2025-05-07T20:32:04.0990994Z     D=5120,
2025-05-07T20:32:04.0991326Z     scale_ub=None,
2025-05-07T20:32:04.0991699Z     contiguous=True,
2025-05-07T20:32:04.0992077Z     compiled=True,
2025-05-07T20:32:04.0992431Z )
2025-05-07T20:32:04.0992998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:04.0993871Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:04.0994360Z 
2025-05-07T20:32:04.0994488Z     @given(
2025-05-07T20:32:04.0994879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:04.0995437Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:04.0995971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:04.0996567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:04.0997155Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:04.0997669Z     )
2025-05-07T20:32:04.0998311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:04.0999134Z     def test_silu_mul_quant(
2025-05-07T20:32:04.0999555Z         self,
2025-05-07T20:32:04.0999887Z         T: int,
2025-05-07T20:32:04.1000346Z         D: int,
2025-05-07T20:32:04.1000727Z         scale_ub: Optional[float],
2025-05-07T20:32:04.1001193Z         contiguous: bool,
2025-05-07T20:32:04.1001621Z         compiled: bool,
2025-05-07T20:32:04.1002008Z     ) -> None:
2025-05-07T20:32:04.1002368Z         torch.manual_seed(2025)
2025-05-07T20:32:04.1002786Z     
2025-05-07T20:32:04.1003189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:04.1003650Z     
2025-05-07T20:32:04.1003919Z         x_sign = torch.sign(x)
2025-05-07T20:32:04.1004326Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:04.1004779Z         x = x_sign * x_clamp
2025-05-07T20:32:04.1005147Z         x0 = x[:, :D]
2025-05-07T20:32:04.1005492Z         x1 = x[:, D:]
2025-05-07T20:32:04.1005790Z     
2025-05-07T20:32:04.1006066Z         if contiguous:
2025-05-07T20:32:04.1006403Z             x0 = x0.contiguous()
2025-05-07T20:32:04.1006861Z             x1 = x1.contiguous()
2025-05-07T20:32:04.1007233Z     
2025-05-07T20:32:04.1007522Z         if scale_ub is not None:
2025-05-07T20:32:04.1007955Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:04.1008515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:04.1009037Z             )
2025-05-07T20:32:04.1009337Z         else:
2025-05-07T20:32:04.1009652Z             scale_ub_tensor = None
2025-05-07T20:32:04.1010058Z     
2025-05-07T20:32:04.1010434Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.1010929Z             op = silu_mul_quant
2025-05-07T20:32:04.1011340Z             if compiled:
2025-05-07T20:32:04.1011749Z                 op = torch.compile(op)
2025-05-07T20:32:04.1012229Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:04.1012691Z     
2025-05-07T20:32:04.1013154Z         y_fp8, y_scale = fn()
2025-05-07T20:32:04.1013927Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:04.1014436Z     
2025-05-07T20:32:04.1014855Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:04.1015454Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:04.1015982Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:04.1016549Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:04.1017196Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.1017740Z     
2025-05-07T20:32:04.1018086Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:04.1018445Z 
2025-05-07T20:32:04.1018625Z moe/activation_test.py:126: 
2025-05-07T20:32:04.1019147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.1019756Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:04.1020565Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:04.1022086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:04.1023545Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:04.1024573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:04.1025868Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:04.1027216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:04.1028597Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:04.1030000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:04.1031236Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:04.1032377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:04.1033361Z     fn()
2025-05-07T20:32:04.1034319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:04.1035423Z     self.fn.run(
2025-05-07T20:32:04.1036297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:04.1037297Z     kernel = self.compile(
2025-05-07T20:32:04.1038313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:04.1039550Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.1040376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:04.1040807Z 
2025-05-07T20:32:04.1041191Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f289dca50>
2025-05-07T20:32:04.1043281Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:04.1045972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f143fae80>}
2025-05-07T20:32:04.1048597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:04.1050572Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3bf6bb0>
2025-05-07T20:32:04.1051112Z 
2025-05-07T20:32:04.1051415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:04.1053267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.1054153Z                            module_map=module_map)
2025-05-07T20:32:04.1054801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.1055440Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:04.1055906Z E       ^
2025-05-07T20:32:04.1056766Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.1057637Z 
2025-05-07T20:32:04.1058449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:04.3432038Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.3434148Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:04.3436993Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.3439720Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.3441670Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.3444196Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.3446930Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.3449250Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.3451780Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.3453709Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:04.3455917Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.3458048Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:04.3459538Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.3461653Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.3463948Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:04.3466079Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:04.3468031Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:04.3470276Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.3472628Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.3474263Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.3476316Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:04.3478490Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:04.3479927Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:04.3482159Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.3484642Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.3486604Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.3488340Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.3489742Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:04.3491681Z W0507 20:32:04.338000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.4070756Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.4072043Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:04.4073485Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.4075096Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.4076115Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.4077480Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.4078921Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.4080710Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.4082152Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.4083238Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:04.4084556Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.4085854Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:04.4086895Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.4088156Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.4089411Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:04.4090496Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:04.4091559Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:04.4092841Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.4094176Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.4095119Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.4096253Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:04.4097394Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:04.4098225Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:04.4099441Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.4100855Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.4101960Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.4102913Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.4103782Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:04.4104850Z W0507 20:32:04.403000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.5907536Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.5908672Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:04.5910079Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.5911953Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.5912980Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.5914647Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.5916092Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.5917464Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.5918916Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.5920009Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:04.5921402Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.5922698Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:04.5923580Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.5924846Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.5926106Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:04.5927182Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:04.5928243Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:04.5929677Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.5931015Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.5931954Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.5933089Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:04.5934172Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:04.5934982Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:04.5936328Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.5937732Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.5938841Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.5939794Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.5940574Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:04.5941643Z W0507 20:32:04.587000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:04.5999571Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:04.6001898Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:04.6004664Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:04.6007063Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:04.6008099Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:04.6009453Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:04.6010891Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:04.6012251Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:04.6014236Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:04.6015361Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:04.6016727Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:04.6018031Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:04.6018915Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.6020180Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:04.6021596Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:04.6022675Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:04.6023744Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:04.6025023Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:04.6026447Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:04.6027632Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:04.6029014Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:04.6030098Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:04.6030904Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:04.6032129Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:04.6033543Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:04.6034648Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:04.6035603Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:04.6036385Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:04.6037505Z W0507 20:32:04.596000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2168809Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.2169969Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:05.2171395Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.2172915Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.2173945Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.2175479Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.2176941Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.2178324Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.2179781Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.2180901Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:05.2182238Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.2183557Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:05.2184450Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.2185722Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.2187001Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:05.2188102Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:05.2189183Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:05.2190482Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.2191838Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.2192788Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.2194026Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:05.2195130Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:05.2195956Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:05.2197207Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.2198632Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.2199842Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.2200910Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.2201701Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:05.2202781Z W0507 20:32:05.019000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.2783234Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.2784801Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:05.2786317Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.2787818Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.2788851Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.2790220Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.2791674Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.2793041Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.2794483Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.2795577Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:05.2797283Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.2798599Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:05.2799490Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.2800845Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.2802111Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:05.2803198Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:05.2804404Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:05.2805684Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.2807031Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.2807982Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.2809128Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:05.2810227Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:05.2811041Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:05.2812272Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.2813928Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.2815042Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.2816011Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.2816798Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:05.2817867Z W0507 20:32:05.274000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4650196Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.4651443Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:05.4653250Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.4654770Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.4655799Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4657209Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.4658651Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4660390Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.4661832Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4662928Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:05.4664245Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.4665561Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:05.4666450Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.4667763Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.4669026Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:05.4670107Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:05.4671176Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:05.4672463Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.4673802Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.4674748Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.4675903Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:05.4676994Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:05.4678448Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:05.4679688Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.4681190Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.4682301Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4683251Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4684121Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:05.4685195Z W0507 20:32:05.461000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.4744855Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.4754604Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:05.4756061Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.4757562Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.4758591Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.4759960Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.4761478Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.4762848Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.4764284Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.4765387Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:05.4766715Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.4768019Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:05.4769165Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.4770433Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.4771701Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:05.4772787Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:05.4773858Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:05.4775140Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.4776605Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.4777555Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.4778697Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:05.4779789Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:05.4780593Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:05.4781836Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.4783250Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.4784362Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.4785318Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.4786097Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:05.4787227Z W0507 20:32:05.470000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.6628332Z 
2025-05-07T20:32:05.6628630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.6629118Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.6629647Z     T=2048,
2025-05-07T20:32:05.6629931Z     D=5120,
2025-05-07T20:32:05.6630135Z     scale_ub=None,
2025-05-07T20:32:05.6630371Z     contiguous=True,
2025-05-07T20:32:05.6630608Z     compiled=True,
2025-05-07T20:32:05.6630820Z )
2025-05-07T20:32:05.6631159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.6631678Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.6631958Z 
2025-05-07T20:32:05.6632050Z     @given(
2025-05-07T20:32:05.6632289Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.6632642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.6633302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.6633649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.6633998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.6634299Z     )
2025-05-07T20:32:05.6634663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.6635127Z     def test_silu_mul_quant(
2025-05-07T20:32:05.6635385Z         self,
2025-05-07T20:32:05.6635589Z         T: int,
2025-05-07T20:32:05.6635796Z         D: int,
2025-05-07T20:32:05.6636030Z         scale_ub: Optional[float],
2025-05-07T20:32:05.6636311Z         contiguous: bool,
2025-05-07T20:32:05.6636568Z         compiled: bool,
2025-05-07T20:32:05.6636814Z     ) -> None:
2025-05-07T20:32:05.6637039Z         torch.manual_seed(2025)
2025-05-07T20:32:05.6637299Z     
2025-05-07T20:32:05.6637589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.6638098Z     
2025-05-07T20:32:05.6638304Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.6638615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.6638946Z         x = x_sign * x_clamp
2025-05-07T20:32:05.6639200Z         x0 = x[:, :D]
2025-05-07T20:32:05.6639431Z         x1 = x[:, D:]
2025-05-07T20:32:05.6639656Z     
2025-05-07T20:32:05.6639848Z         if contiguous:
2025-05-07T20:32:05.6640095Z             x0 = x0.contiguous()
2025-05-07T20:32:05.6640453Z             x1 = x1.contiguous()
2025-05-07T20:32:05.6640700Z     
2025-05-07T20:32:05.6640909Z         if scale_ub is not None:
2025-05-07T20:32:05.6641206Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.6641559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.6641889Z             )
2025-05-07T20:32:05.6642095Z         else:
2025-05-07T20:32:05.6642319Z             scale_ub_tensor = None
2025-05-07T20:32:05.6642590Z     
2025-05-07T20:32:05.6642846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.6643188Z             op = silu_mul_quant
2025-05-07T20:32:05.6643449Z             if compiled:
2025-05-07T20:32:05.6643714Z                 op = torch.compile(op)
2025-05-07T20:32:05.6644027Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.6644317Z     
2025-05-07T20:32:05.6644523Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.6644825Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.6645127Z     
2025-05-07T20:32:05.6645380Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.6645735Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.6646039Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.6646372Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.6646755Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.6647134Z     
2025-05-07T20:32:05.6647347Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.6647559Z 
2025-05-07T20:32:05.6647671Z moe/activation_test.py:126: 
2025-05-07T20:32:05.6647988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.6648343Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.6648693Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.6649523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.6650304Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.6650889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.6651609Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.6652334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.6653189Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.6653960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.6654634Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.6655270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.6655808Z     fn()
2025-05-07T20:32:05.6656343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.6656959Z     self.fn.run(
2025-05-07T20:32:05.6657443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.6658002Z     kernel = self.compile(
2025-05-07T20:32:05.6658573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.6659350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.6659764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.6660013Z 
2025-05-07T20:32:05.6660233Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f289dd3d0>
2025-05-07T20:32:05.6661364Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.6662816Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f141f0b80>}
2025-05-07T20:32:05.6664208Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.6665283Z context = <triton._C.libtriton.ir.context object at 0x7f1ef37a80b0>
2025-05-07T20:32:05.6665593Z 
2025-05-07T20:32:05.6665769Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.6666321Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.6666810Z                            module_map=module_map)
2025-05-07T20:32:05.6667199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.6667579Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.6667863Z E       ^
2025-05-07T20:32:05.6668344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.6668819Z 
2025-05-07T20:32:05.6669254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.6669792Z 
2025-05-07T20:32:05.6669916Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:05.6670347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:05.6670772Z     T=128,
2025-05-07T20:32:05.6670971Z     D=5120,
2025-05-07T20:32:05.6671176Z     scale_ub=None,
2025-05-07T20:32:05.6671395Z     contiguous=True,
2025-05-07T20:32:05.6671631Z     compiled=True,
2025-05-07T20:32:05.6671846Z )
2025-05-07T20:32:05.6672177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:05.6672692Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:05.6672968Z 
2025-05-07T20:32:05.6673055Z     @given(
2025-05-07T20:32:05.6673292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:05.6673626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:05.6673962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:05.6674311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:05.6674745Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:05.6675057Z     )
2025-05-07T20:32:05.6675426Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:05.6675887Z     def test_silu_mul_quant(
2025-05-07T20:32:05.6676146Z         self,
2025-05-07T20:32:05.6676354Z         T: int,
2025-05-07T20:32:05.6676558Z         D: int,
2025-05-07T20:32:05.6676791Z         scale_ub: Optional[float],
2025-05-07T20:32:05.6677080Z         contiguous: bool,
2025-05-07T20:32:05.6677329Z         compiled: bool,
2025-05-07T20:32:05.6677567Z     ) -> None:
2025-05-07T20:32:05.6677796Z         torch.manual_seed(2025)
2025-05-07T20:32:05.6678048Z     
2025-05-07T20:32:05.6678343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:05.6678703Z     
2025-05-07T20:32:05.6678906Z         x_sign = torch.sign(x)
2025-05-07T20:32:05.6679211Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:05.6679647Z         x = x_sign * x_clamp
2025-05-07T20:32:05.6679908Z         x0 = x[:, :D]
2025-05-07T20:32:05.6680259Z         x1 = x[:, D:]
2025-05-07T20:32:05.6680484Z     
2025-05-07T20:32:05.6680676Z         if contiguous:
2025-05-07T20:32:05.6680921Z             x0 = x0.contiguous()
2025-05-07T20:32:05.6681192Z             x1 = x1.contiguous()
2025-05-07T20:32:05.6681439Z     
2025-05-07T20:32:05.6681644Z         if scale_ub is not None:
2025-05-07T20:32:05.6681930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:05.6682278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:05.6682608Z             )
2025-05-07T20:32:05.6682818Z         else:
2025-05-07T20:32:05.6683040Z             scale_ub_tensor = None
2025-05-07T20:32:05.6683304Z     
2025-05-07T20:32:05.6683548Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.6683874Z             op = silu_mul_quant
2025-05-07T20:32:05.6684147Z             if compiled:
2025-05-07T20:32:05.6684411Z                 op = torch.compile(op)
2025-05-07T20:32:05.6684722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:05.6685017Z     
2025-05-07T20:32:05.6685219Z         y_fp8, y_scale = fn()
2025-05-07T20:32:05.6685512Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:05.6685822Z     
2025-05-07T20:32:05.6686074Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:05.6686428Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:05.6686735Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:05.6687112Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:05.6687491Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.6687812Z     
2025-05-07T20:32:05.6688024Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:05.6688228Z 
2025-05-07T20:32:05.6688338Z moe/activation_test.py:126: 
2025-05-07T20:32:05.6688646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.6689007Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:05.6689352Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:05.6690176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:05.6690954Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:05.6691529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:05.6692245Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:05.6692974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:05.6693723Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:05.6694581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:05.6695259Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:05.6695886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:05.6696432Z     fn()
2025-05-07T20:32:05.6696967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:05.6697574Z     self.fn.run(
2025-05-07T20:32:05.6698062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:05.6698621Z     kernel = self.compile(
2025-05-07T20:32:05.6699192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:05.6699875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.6700384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:05.6700633Z 
2025-05-07T20:32:05.6700850Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3749750>
2025-05-07T20:32:05.6701980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:05.6703412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f141f1da0>}
2025-05-07T20:32:05.6704806Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:05.6705877Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3999ef0>
2025-05-07T20:32:05.6706186Z 
2025-05-07T20:32:05.6706375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:05.6706960Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.6707470Z                            module_map=module_map)
2025-05-07T20:32:05.6707856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.6708235Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:05.6708512Z E       ^
2025-05-07T20:32:05.6709000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.6709469Z 
2025-05-07T20:32:05.6709910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:05.8984291Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.8985466Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:05.8986875Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.8988385Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.8989419Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.8991104Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.8992567Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.8993942Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.8995389Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.8996493Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:05.8997957Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.8999258Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:05.9000224Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.9001492Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.9002754Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:05.9003845Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:05.9004923Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:05.9006207Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.9007556Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.9008514Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.9009662Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:05.9010755Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:05.9011578Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:05.9012812Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.9014573Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.9015816Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.9016781Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.9017567Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:05.9018643Z W0507 20:32:05.894000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:05.9597343Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:05.9598471Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:05.9600299Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:05.9601809Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:05.9602841Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:05.9604210Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:05.9605660Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:05.9607033Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:05.9608472Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:05.9609572Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:05.9610897Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:05.9612196Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:05.9613088Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.9614589Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:05.9615858Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:05.9616944Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:05.9618187Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:05.9619476Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:05.9620815Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:05.9621770Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:05.9622906Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:05.9624121Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:05.9624942Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:05.9626176Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:05.9627593Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:05.9628696Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:05.9629656Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:05.9630445Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:05.9631516Z W0507 20:32:05.956000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.1473029Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.1474156Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:06.1475590Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.1477113Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.1478153Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.1479536Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.1481060Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.1482765Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.1484224Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.1485327Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:06.1486657Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.1487962Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:06.1488989Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.1490254Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.1491527Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:06.1492617Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:06.1493684Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:06.1494973Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.1496312Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.1497263Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.1498405Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:06.1499490Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:06.1500315Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:06.1501545Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.1502962Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.1504076Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.1505032Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.1505818Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:06.1506979Z W0507 20:32:06.143000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.1571507Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.1572784Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:06.1574176Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.1575670Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.1576875Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.1578247Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.1579692Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.1581061Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.1582522Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.1583621Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:06.1584939Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.1586242Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:06.1587180Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.1588449Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.1589713Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:06.1590791Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:06.1591861Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:06.1593142Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.1594604Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.1595545Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.1596680Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:06.1597949Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:06.1598762Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:06.1599995Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.1601614Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.1602726Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.1603681Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.1604464Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:06.1605529Z W0507 20:32:06.153000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.6118877Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.6120028Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:06.6121543Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.6123080Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.6124125Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.6125530Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.6127007Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.6128407Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.6129874Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.6131376Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:06.6132731Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.6142417Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:06.6143326Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.6144609Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.6146093Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:06.6147181Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:06.6148251Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:06.6149531Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.6150891Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.6151850Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.6152995Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:06.6154093Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:06.6154910Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:06.6156145Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.6157566Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.6158679Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.6159637Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.6160536Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:06.6161607Z W0507 20:32:06.608000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.6730604Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.6731731Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:06.6733146Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.6734652Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.6735681Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.6737066Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.6738662Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.6740040Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.6741500Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.6742605Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:06.6743957Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.6745281Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:06.6746178Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.6747448Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.6748718Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:06.6749820Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:06.6750903Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:06.6752193Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.6753539Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.6754486Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.6755745Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:06.6756850Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:06.6757721Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:06.6758951Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.6760490Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.6761703Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.6762669Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.6763459Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:06.6764538Z W0507 20:32:06.669000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.8600690Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.8601808Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:06.8603243Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.8604758Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.8605797Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.8607163Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.8608627Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.8610005Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.8611449Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.8612556Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:06.8614469Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.8615791Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:06.8616687Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.8617956Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.8619238Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:06.8620325Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:06.8621542Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:06.8622825Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.8624178Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.8625125Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.8626271Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:06.8627389Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:06.8628207Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:06.8629450Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.8630875Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.8631996Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.8632981Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.8633771Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:06.8634844Z W0507 20:32:06.856000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:06.8693336Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:06.8694445Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:06.8695970Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:06.8697523Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:06.8698560Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:06.8699923Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:06.8701377Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:06.8702835Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:06.8704286Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:06.8705392Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:06.8706713Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:06.8708076Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:06.8708974Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.8710242Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:06.8711510Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:06.8712598Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:06.8713898Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:06.8715192Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:06.8716537Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:06.8717540Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:06.8718676Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:06.8719780Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:06.8720840Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:06.8722083Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:06.8723497Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:06.8724616Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:06.8725586Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:06.8726375Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:06.8727578Z W0507 20:32:06.866000 227432 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0904322Z 
2025-05-07T20:32:07.0904652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.0905272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.0905862Z     T=4096,
2025-05-07T20:32:07.0906127Z     D=5120,
2025-05-07T20:32:07.0906393Z     scale_ub=None,
2025-05-07T20:32:07.0906680Z     contiguous=True,
2025-05-07T20:32:07.0906950Z     compiled=True,
2025-05-07T20:32:07.0907187Z )
2025-05-07T20:32:07.0907580Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.0908100Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.0908408Z 
2025-05-07T20:32:07.0908493Z     @given(
2025-05-07T20:32:07.0908758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.0909094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.0909427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.0909789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.0910137Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.0910448Z     )
2025-05-07T20:32:07.0910828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.0911298Z     def test_silu_mul_quant(
2025-05-07T20:32:07.0911557Z         self,
2025-05-07T20:32:07.0911775Z         T: int,
2025-05-07T20:32:07.0911993Z         D: int,
2025-05-07T20:32:07.0912223Z         scale_ub: Optional[float],
2025-05-07T20:32:07.0912524Z         contiguous: bool,
2025-05-07T20:32:07.0912788Z         compiled: bool,
2025-05-07T20:32:07.0913028Z     ) -> None:
2025-05-07T20:32:07.0913274Z         torch.manual_seed(2025)
2025-05-07T20:32:07.0914825Z     
2025-05-07T20:32:07.0915190Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.0915571Z     
2025-05-07T20:32:07.0915785Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.0916093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.0916431Z         x = x_sign * x_clamp
2025-05-07T20:32:07.0916698Z         x0 = x[:, :D]
2025-05-07T20:32:07.0916919Z         x1 = x[:, D:]
2025-05-07T20:32:07.0917137Z     
2025-05-07T20:32:07.0917332Z         if contiguous:
2025-05-07T20:32:07.0917567Z             x0 = x0.contiguous()
2025-05-07T20:32:07.0917841Z             x1 = x1.contiguous()
2025-05-07T20:32:07.0918092Z     
2025-05-07T20:32:07.0918295Z         if scale_ub is not None:
2025-05-07T20:32:07.0918588Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.0918942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.0919265Z             )
2025-05-07T20:32:07.0919482Z         else:
2025-05-07T20:32:07.0920051Z             scale_ub_tensor = None
2025-05-07T20:32:07.0920440Z     
2025-05-07T20:32:07.0920692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0921016Z             op = silu_mul_quant
2025-05-07T20:32:07.0921282Z             if compiled:
2025-05-07T20:32:07.0921543Z                 op = torch.compile(op)
2025-05-07T20:32:07.0921855Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.0922136Z     
2025-05-07T20:32:07.0922342Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.0922640Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.0922936Z     
2025-05-07T20:32:07.0923184Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0923532Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.0923831Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.0924155Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.0924754Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0925079Z     
2025-05-07T20:32:07.0925285Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.0925494Z 
2025-05-07T20:32:07.0925602Z moe/activation_test.py:126: 
2025-05-07T20:32:07.0925908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0926257Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.0926598Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0927420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.0928198Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.0928768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.0929483Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.0930215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.0930963Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.0931724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.0932394Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.0933022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.0933554Z     fn()
2025-05-07T20:32:07.0934081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.0934682Z     self.fn.run(
2025-05-07T20:32:07.0935162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.0935721Z     kernel = self.compile(
2025-05-07T20:32:07.0936286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.0936965Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.0937396Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0937661Z 
2025-05-07T20:32:07.0937874Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2a08c350>
2025-05-07T20:32:07.0938995Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.0940437Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef32bf1a0>}
2025-05-07T20:32:07.0941915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.0942982Z context = <triton._C.libtriton.ir.context object at 0x7f1ef310d8b0>
2025-05-07T20:32:07.0943291Z 
2025-05-07T20:32:07.0943465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.0944013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.0944493Z                            module_map=module_map)
2025-05-07T20:32:07.0944875Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.0945249Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.0945518Z E       ^
2025-05-07T20:32:07.0946001Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0946554Z 
2025-05-07T20:32:07.0946994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.0947526Z 
2025-05-07T20:32:07.0947643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.0948073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.0948492Z     T=16384,
2025-05-07T20:32:07.0948695Z     D=5120,
2025-05-07T20:32:07.0948890Z     scale_ub=None,
2025-05-07T20:32:07.0949112Z     contiguous=True,
2025-05-07T20:32:07.0949343Z     compiled=True,
2025-05-07T20:32:07.0949561Z )
2025-05-07T20:32:07.0949891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.0950409Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:07.0950694Z 
2025-05-07T20:32:07.0950778Z     @given(
2025-05-07T20:32:07.0951012Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.0951341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.0951666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.0952001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.0952341Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.0952638Z     )
2025-05-07T20:32:07.0952997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.0953455Z     def test_silu_mul_quant(
2025-05-07T20:32:07.0953705Z         self,
2025-05-07T20:32:07.0953904Z         T: int,
2025-05-07T20:32:07.0954102Z         D: int,
2025-05-07T20:32:07.0954327Z         scale_ub: Optional[float],
2025-05-07T20:32:07.0954606Z         contiguous: bool,
2025-05-07T20:32:07.0954849Z         compiled: bool,
2025-05-07T20:32:07.0955080Z     ) -> None:
2025-05-07T20:32:07.0955304Z         torch.manual_seed(2025)
2025-05-07T20:32:07.0955556Z     
2025-05-07T20:32:07.0955842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.0956205Z     
2025-05-07T20:32:07.0956403Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.0956707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.0957027Z         x = x_sign * x_clamp
2025-05-07T20:32:07.0957271Z         x0 = x[:, :D]
2025-05-07T20:32:07.0957495Z         x1 = x[:, D:]
2025-05-07T20:32:07.0957726Z     
2025-05-07T20:32:07.0957943Z         if contiguous:
2025-05-07T20:32:07.0958184Z             x0 = x0.contiguous()
2025-05-07T20:32:07.0958453Z             x1 = x1.contiguous()
2025-05-07T20:32:07.0958704Z     
2025-05-07T20:32:07.0958895Z         if scale_ub is not None:
2025-05-07T20:32:07.0959178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.0959528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.0959844Z             )
2025-05-07T20:32:07.0960047Z         else:
2025-05-07T20:32:07.0960337Z             scale_ub_tensor = None
2025-05-07T20:32:07.0960590Z     
2025-05-07T20:32:07.0960832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0961241Z             op = silu_mul_quant
2025-05-07T20:32:07.0961494Z             if compiled:
2025-05-07T20:32:07.0961751Z                 op = torch.compile(op)
2025-05-07T20:32:07.0962058Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.0962336Z     
2025-05-07T20:32:07.0962535Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.0962830Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.0963127Z     
2025-05-07T20:32:07.0963378Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.0963727Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.0964029Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.0964348Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.0964716Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0965038Z     
2025-05-07T20:32:07.0965240Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.0965530Z 
2025-05-07T20:32:07.0965640Z moe/activation_test.py:126: 
2025-05-07T20:32:07.0965947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0966289Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.0966628Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.0967477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.0968279Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.0968840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.0969547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.0970259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.0971023Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.0971774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.0972439Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.0973063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.0973592Z     fn()
2025-05-07T20:32:07.0974113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.0974713Z     self.fn.run(
2025-05-07T20:32:07.0975198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.0975740Z     kernel = self.compile(
2025-05-07T20:32:07.0976298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.0976986Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.0977394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.0977640Z 
2025-05-07T20:32:07.0977856Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f299c7150>
2025-05-07T20:32:07.0978969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.0980387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef3a90860>}
2025-05-07T20:32:07.0981782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.0982919Z context = <triton._C.libtriton.ir.context object at 0x7f1ef2662370>
2025-05-07T20:32:07.0983226Z 
2025-05-07T20:32:07.0983397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.0983938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.0984420Z                            module_map=module_map)
2025-05-07T20:32:07.0984792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.0985162Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.0985436Z E       ^
2025-05-07T20:32:07.0985912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.0986393Z 
2025-05-07T20:32:07.0986824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.1201445Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:07.1202981Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:07.1204368Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:07.1205393Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:07.1206545Z W0507 20:32:07.119000 227432 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:07.3332164Z 
2025-05-07T20:32:07.3332427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.3333046Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.3333512Z     T=1,
2025-05-07T20:32:07.3333708Z     D=5120,
2025-05-07T20:32:07.3333960Z     scale_ub=1200.0,
2025-05-07T20:32:07.3334198Z     contiguous=True,
2025-05-07T20:32:07.3334436Z     compiled=True,
2025-05-07T20:32:07.3334655Z )
2025-05-07T20:32:07.3334995Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.3335512Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:07.3335786Z 
2025-05-07T20:32:07.3335877Z     @given(
2025-05-07T20:32:07.3336117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.3336451Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.3336776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.3337127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.3337484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.3337791Z     )
2025-05-07T20:32:07.3338154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.3338620Z     def test_silu_mul_quant(
2025-05-07T20:32:07.3338880Z         self,
2025-05-07T20:32:07.3339082Z         T: int,
2025-05-07T20:32:07.3339294Z         D: int,
2025-05-07T20:32:07.3339528Z         scale_ub: Optional[float],
2025-05-07T20:32:07.3339816Z         contiguous: bool,
2025-05-07T20:32:07.3340067Z         compiled: bool,
2025-05-07T20:32:07.3340311Z     ) -> None:
2025-05-07T20:32:07.3340540Z         torch.manual_seed(2025)
2025-05-07T20:32:07.3340791Z     
2025-05-07T20:32:07.3341079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.3341439Z     
2025-05-07T20:32:07.3341639Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.3341948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.3342285Z         x = x_sign * x_clamp
2025-05-07T20:32:07.3342704Z         x0 = x[:, :D]
2025-05-07T20:32:07.3342939Z         x1 = x[:, D:]
2025-05-07T20:32:07.3343159Z     
2025-05-07T20:32:07.3343352Z         if contiguous:
2025-05-07T20:32:07.3343598Z             x0 = x0.contiguous()
2025-05-07T20:32:07.3343874Z             x1 = x1.contiguous()
2025-05-07T20:32:07.3344143Z     
2025-05-07T20:32:07.3344342Z         if scale_ub is not None:
2025-05-07T20:32:07.3344637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.3344995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.3345327Z             )
2025-05-07T20:32:07.3345528Z         else:
2025-05-07T20:32:07.3345753Z             scale_ub_tensor = None
2025-05-07T20:32:07.3346019Z     
2025-05-07T20:32:07.3346264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.3346602Z             op = silu_mul_quant
2025-05-07T20:32:07.3346994Z             if compiled:
2025-05-07T20:32:07.3355593Z                 op = torch.compile(op)
2025-05-07T20:32:07.3355972Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.3356266Z     
2025-05-07T20:32:07.3356484Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.3356664Z 
2025-05-07T20:32:07.3356782Z moe/activation_test.py:117: 
2025-05-07T20:32:07.3357096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3357456Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.3357761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.3358355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.3358948Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.3359648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.3360468Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.3361047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.3361770Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.3362475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.3363041Z     kernel = self.compile(
2025-05-07T20:32:07.3363616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.3364316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.3364751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3364995Z 
2025-05-07T20:32:07.3365216Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f236ddf50>
2025-05-07T20:32:07.3366358Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.3367808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2d12ca0>}
2025-05-07T20:32:07.3369212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.3370289Z context = <triton._C.libtriton.ir.context object at 0x7f1ef268e030>
2025-05-07T20:32:07.3370596Z 
2025-05-07T20:32:07.3370775Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.3371339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.3371837Z                            module_map=module_map)
2025-05-07T20:32:07.3372400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.3372776Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.3373053Z E       ^
2025-05-07T20:32:07.3373550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3374021Z 
2025-05-07T20:32:07.3374458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.3375000Z 
2025-05-07T20:32:07.3375112Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.3375554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.3375978Z     T=1,
2025-05-07T20:32:07.3376170Z     D=5120,
2025-05-07T20:32:07.3376383Z     scale_ub=None,
2025-05-07T20:32:07.3376612Z     contiguous=False,
2025-05-07T20:32:07.3376849Z     compiled=True,
2025-05-07T20:32:07.3377147Z )
2025-05-07T20:32:07.3377493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.3378050Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:07.3378331Z 
2025-05-07T20:32:07.3378412Z     @given(
2025-05-07T20:32:07.3378658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.3378983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.3379312Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.3379662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.3380019Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.3380319Z     )
2025-05-07T20:32:07.3380690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.3381159Z     def test_silu_mul_quant(
2025-05-07T20:32:07.3381412Z         self,
2025-05-07T20:32:07.3381622Z         T: int,
2025-05-07T20:32:07.3381827Z         D: int,
2025-05-07T20:32:07.3382057Z         scale_ub: Optional[float],
2025-05-07T20:32:07.3382351Z         contiguous: bool,
2025-05-07T20:32:07.3382610Z         compiled: bool,
2025-05-07T20:32:07.3382843Z     ) -> None:
2025-05-07T20:32:07.3383075Z         torch.manual_seed(2025)
2025-05-07T20:32:07.3383332Z     
2025-05-07T20:32:07.3383617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.3383981Z     
2025-05-07T20:32:07.3384190Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.3384495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.3384830Z         x = x_sign * x_clamp
2025-05-07T20:32:07.3385090Z         x0 = x[:, :D]
2025-05-07T20:32:07.3385322Z         x1 = x[:, D:]
2025-05-07T20:32:07.3385539Z     
2025-05-07T20:32:07.3385741Z         if contiguous:
2025-05-07T20:32:07.3385988Z             x0 = x0.contiguous()
2025-05-07T20:32:07.3386260Z             x1 = x1.contiguous()
2025-05-07T20:32:07.3386518Z     
2025-05-07T20:32:07.3386727Z         if scale_ub is not None:
2025-05-07T20:32:07.3387020Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.3387397Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.3387767Z             )
2025-05-07T20:32:07.3387967Z         else:
2025-05-07T20:32:07.3388193Z             scale_ub_tensor = None
2025-05-07T20:32:07.3388460Z     
2025-05-07T20:32:07.3388704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.3389041Z             op = silu_mul_quant
2025-05-07T20:32:07.3389314Z             if compiled:
2025-05-07T20:32:07.3389572Z                 op = torch.compile(op)
2025-05-07T20:32:07.3389892Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.3390183Z     
2025-05-07T20:32:07.3390391Z         y_fp8, y_scale = fn()
2025-05-07T20:32:07.3390688Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:07.3391005Z     
2025-05-07T20:32:07.3391263Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.3391619Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:07.3392019Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:07.3392357Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:07.3392735Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.3393070Z     
2025-05-07T20:32:07.3393291Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:07.3393500Z 
2025-05-07T20:32:07.3393607Z moe/activation_test.py:126: 
2025-05-07T20:32:07.3393930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3394292Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:07.3394644Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:07.3395466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:07.3396252Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:07.3396920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.3397646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.3398367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:07.3399136Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:07.3399919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:07.3400673Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:07.3401317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:07.3401869Z     fn()
2025-05-07T20:32:07.3402410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:07.3403025Z     self.fn.run(
2025-05-07T20:32:07.3403530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.3404095Z     kernel = self.compile(
2025-05-07T20:32:07.3404664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.3405357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.3405791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.3406037Z 
2025-05-07T20:32:07.3406270Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2a66fdd0>
2025-05-07T20:32:07.3407399Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.3408893Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef38c60c0>}
2025-05-07T20:32:07.3410289Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.3411355Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1c521b0>
2025-05-07T20:32:07.3411657Z 
2025-05-07T20:32:07.3411845Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.3412394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.3412898Z                            module_map=module_map)
2025-05-07T20:32:07.3413289Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.3413932Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:07.3414230Z E       ^
2025-05-07T20:32:07.3414864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.3415346Z 
2025-05-07T20:32:07.3415793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.4794648Z 
2025-05-07T20:32:07.4794914Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.4795403Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.4795868Z     T=1,
2025-05-07T20:32:07.4796071Z     D=5120,
2025-05-07T20:32:07.4796280Z     scale_ub=None,
2025-05-07T20:32:07.4796500Z     contiguous=True,
2025-05-07T20:32:07.4796739Z     compiled=False,
2025-05-07T20:32:07.4796959Z )
2025-05-07T20:32:07.4797291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.4797803Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:07.4798254Z 
2025-05-07T20:32:07.4798351Z     @given(
2025-05-07T20:32:07.4798597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.4798920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.4799243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.4799588Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.4799932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.4800324Z     )
2025-05-07T20:32:07.4800690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.4801144Z     def test_silu_mul_quant(
2025-05-07T20:32:07.4801397Z         self,
2025-05-07T20:32:07.4801603Z         T: int,
2025-05-07T20:32:07.4801802Z         D: int,
2025-05-07T20:32:07.4802035Z         scale_ub: Optional[float],
2025-05-07T20:32:07.4802320Z         contiguous: bool,
2025-05-07T20:32:07.4802561Z         compiled: bool,
2025-05-07T20:32:07.4802804Z     ) -> None:
2025-05-07T20:32:07.4803023Z         torch.manual_seed(2025)
2025-05-07T20:32:07.4803272Z     
2025-05-07T20:32:07.4803551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.4803907Z     
2025-05-07T20:32:07.4804103Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.4804407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.4804730Z         x = x_sign * x_clamp
2025-05-07T20:32:07.4804981Z         x0 = x[:, :D]
2025-05-07T20:32:07.4805211Z         x1 = x[:, D:]
2025-05-07T20:32:07.4805430Z     
2025-05-07T20:32:07.4805618Z         if contiguous:
2025-05-07T20:32:07.4805857Z             x0 = x0.contiguous()
2025-05-07T20:32:07.4806126Z             x1 = x1.contiguous()
2025-05-07T20:32:07.4806371Z     
2025-05-07T20:32:07.4806569Z         if scale_ub is not None:
2025-05-07T20:32:07.4806852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.4807197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.4807528Z             )
2025-05-07T20:32:07.4807737Z         else:
2025-05-07T20:32:07.4807952Z             scale_ub_tensor = None
2025-05-07T20:32:07.4808212Z     
2025-05-07T20:32:07.4808453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.4808779Z             op = silu_mul_quant
2025-05-07T20:32:07.4809039Z             if compiled:
2025-05-07T20:32:07.4809296Z                 op = torch.compile(op)
2025-05-07T20:32:07.4809606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.4809886Z     
2025-05-07T20:32:07.4810085Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.4810255Z 
2025-05-07T20:32:07.4810363Z moe/activation_test.py:117: 
2025-05-07T20:32:07.4810665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.4811011Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.4811306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.4812148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.4812877Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.4813586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.4814303Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.4814989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.4815542Z     kernel = self.compile(
2025-05-07T20:32:07.4816105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.4816794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.4817202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.4817445Z 
2025-05-07T20:32:07.4817834Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3fb93d0>
2025-05-07T20:32:07.4818993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.4820425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f164f1bc0>}
2025-05-07T20:32:07.4821818Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.4822880Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1cb7d70>
2025-05-07T20:32:07.4823187Z 
2025-05-07T20:32:07.4823361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.4823920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.4824402Z                            module_map=module_map)
2025-05-07T20:32:07.4824785Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.4825155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.4825422Z E       ^
2025-05-07T20:32:07.4825908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.4826386Z 
2025-05-07T20:32:07.4826818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.4827347Z 
2025-05-07T20:32:07.4827462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.4827891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.4828356Z     T=128,
2025-05-07T20:32:07.4828553Z     D=5120,
2025-05-07T20:32:07.4828757Z     scale_ub=None,
2025-05-07T20:32:07.4828979Z     contiguous=False,
2025-05-07T20:32:07.4829217Z     compiled=True,
2025-05-07T20:32:07.4829427Z )
2025-05-07T20:32:07.4829753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.4830264Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:07.4830545Z 
2025-05-07T20:32:07.4830628Z     @given(
2025-05-07T20:32:07.4830861Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.4831188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.4831508Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.4831846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.4832189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.4832488Z     )
2025-05-07T20:32:07.4832856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.4833317Z     def test_silu_mul_quant(
2025-05-07T20:32:07.4833571Z         self,
2025-05-07T20:32:07.4833890Z         T: int,
2025-05-07T20:32:07.4834093Z         D: int,
2025-05-07T20:32:07.4834319Z         scale_ub: Optional[float],
2025-05-07T20:32:07.4834601Z         contiguous: bool,
2025-05-07T20:32:07.4834843Z         compiled: bool,
2025-05-07T20:32:07.4835071Z     ) -> None:
2025-05-07T20:32:07.4835295Z         torch.manual_seed(2025)
2025-05-07T20:32:07.4835543Z     
2025-05-07T20:32:07.4835825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.4836185Z     
2025-05-07T20:32:07.4836379Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.4836682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.4837004Z         x = x_sign * x_clamp
2025-05-07T20:32:07.4837248Z         x0 = x[:, :D]
2025-05-07T20:32:07.4837493Z         x1 = x[:, D:]
2025-05-07T20:32:07.4837739Z     
2025-05-07T20:32:07.4837926Z         if contiguous:
2025-05-07T20:32:07.4838254Z             x0 = x0.contiguous()
2025-05-07T20:32:07.4838532Z             x1 = x1.contiguous()
2025-05-07T20:32:07.4838779Z     
2025-05-07T20:32:07.4838971Z         if scale_ub is not None:
2025-05-07T20:32:07.4839254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.4839606Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.4839923Z             )
2025-05-07T20:32:07.4840206Z         else:
2025-05-07T20:32:07.4840426Z             scale_ub_tensor = None
2025-05-07T20:32:07.4840683Z     
2025-05-07T20:32:07.4840926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.4841252Z             op = silu_mul_quant
2025-05-07T20:32:07.4841507Z             if compiled:
2025-05-07T20:32:07.4841769Z                 op = torch.compile(op)
2025-05-07T20:32:07.4842086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.4842367Z     
2025-05-07T20:32:07.4842569Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.4842750Z 
2025-05-07T20:32:07.4842859Z moe/activation_test.py:117: 
2025-05-07T20:32:07.4843172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.4843514Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.4843808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.4844391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.4844968Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.4845653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.4846370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.4846929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.4847638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.4848334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.4848899Z     kernel = self.compile(
2025-05-07T20:32:07.4849463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.4850147Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.4850564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.4850805Z 
2025-05-07T20:32:07.4851024Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f29a48dd0>
2025-05-07T20:32:07.4852139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.4853562Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef38c7e20>}
2025-05-07T20:32:07.4855042Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.4856098Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1c73930>
2025-05-07T20:32:07.4856399Z 
2025-05-07T20:32:07.4856579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.4857125Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.4857637Z                            module_map=module_map)
2025-05-07T20:32:07.4858051Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.4858416Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.4858686Z E       ^
2025-05-07T20:32:07.4859172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.4859729Z 
2025-05-07T20:32:07.4860174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.4860707Z 
2025-05-07T20:32:07.4860819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.4861252Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.4861672Z     T=128,
2025-05-07T20:32:07.4861863Z     D=7168,
2025-05-07T20:32:07.4862069Z     scale_ub=1200.0,
2025-05-07T20:32:07.4862304Z     contiguous=False,
2025-05-07T20:32:07.4862533Z     compiled=False,
2025-05-07T20:32:07.6407709Z )
2025-05-07T20:32:07.6408093Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.6408624Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:07.6408962Z 
2025-05-07T20:32:07.6409044Z     @given(
2025-05-07T20:32:07.6409307Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.6409636Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.6409958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.6410305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.6410645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.6410944Z     )
2025-05-07T20:32:07.6411310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.6411768Z     def test_silu_mul_quant(
2025-05-07T20:32:07.6412026Z         self,
2025-05-07T20:32:07.6412234Z         T: int,
2025-05-07T20:32:07.6412435Z         D: int,
2025-05-07T20:32:07.6412668Z         scale_ub: Optional[float],
2025-05-07T20:32:07.6412952Z         contiguous: bool,
2025-05-07T20:32:07.6413202Z         compiled: bool,
2025-05-07T20:32:07.6413579Z     ) -> None:
2025-05-07T20:32:07.6413805Z         torch.manual_seed(2025)
2025-05-07T20:32:07.6414061Z     
2025-05-07T20:32:07.6414356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.6414715Z     
2025-05-07T20:32:07.6414917Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.6415213Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.6415537Z         x = x_sign * x_clamp
2025-05-07T20:32:07.6415788Z         x0 = x[:, :D]
2025-05-07T20:32:07.6416011Z         x1 = x[:, D:]
2025-05-07T20:32:07.6416228Z     
2025-05-07T20:32:07.6416421Z         if contiguous:
2025-05-07T20:32:07.6416655Z             x0 = x0.contiguous()
2025-05-07T20:32:07.6416925Z             x1 = x1.contiguous()
2025-05-07T20:32:07.6417177Z     
2025-05-07T20:32:07.6417371Z         if scale_ub is not None:
2025-05-07T20:32:07.6417655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.6418018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.6418351Z             )
2025-05-07T20:32:07.6418550Z         else:
2025-05-07T20:32:07.6418777Z             scale_ub_tensor = None
2025-05-07T20:32:07.6419040Z     
2025-05-07T20:32:07.6419446Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6419790Z             op = silu_mul_quant
2025-05-07T20:32:07.6420053Z             if compiled:
2025-05-07T20:32:07.6420309Z                 op = torch.compile(op)
2025-05-07T20:32:07.6420618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6420905Z     
2025-05-07T20:32:07.6421102Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.6421280Z 
2025-05-07T20:32:07.6421383Z moe/activation_test.py:117: 
2025-05-07T20:32:07.6421694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6422040Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.6422332Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6423055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.6423896Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.6424460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.6425173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.6425871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.6426422Z     kernel = self.compile(
2025-05-07T20:32:07.6426990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.6427677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.6428143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6428385Z 
2025-05-07T20:32:07.6428606Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f141b8d50>
2025-05-07T20:32:07.6429728Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.6431162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef38c6ac0>}
2025-05-07T20:32:07.6432552Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.6433612Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1eeeb30>
2025-05-07T20:32:07.6433915Z 
2025-05-07T20:32:07.6434093Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.6434639Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.6435134Z                            module_map=module_map)
2025-05-07T20:32:07.6435518Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.6435882Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.6436154Z E       ^
2025-05-07T20:32:07.6436641Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6437108Z 
2025-05-07T20:32:07.6437540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.6438076Z 
2025-05-07T20:32:07.6438192Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.6438665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.6445454Z     T=128,
2025-05-07T20:32:07.6445661Z     D=5120,
2025-05-07T20:32:07.6445859Z     scale_ub=None,
2025-05-07T20:32:07.6446091Z     contiguous=False,
2025-05-07T20:32:07.6446332Z     compiled=False,
2025-05-07T20:32:07.6446546Z )
2025-05-07T20:32:07.6447015Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.6447544Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:07.6447827Z 
2025-05-07T20:32:07.6447907Z     @given(
2025-05-07T20:32:07.6448149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.6448481Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.6448799Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.6449142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.6449488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.6449788Z     )
2025-05-07T20:32:07.6450150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.6450616Z     def test_silu_mul_quant(
2025-05-07T20:32:07.6450871Z         self,
2025-05-07T20:32:07.6451072Z         T: int,
2025-05-07T20:32:07.6451367Z         D: int,
2025-05-07T20:32:07.6451593Z         scale_ub: Optional[float],
2025-05-07T20:32:07.6451878Z         contiguous: bool,
2025-05-07T20:32:07.6452127Z         compiled: bool,
2025-05-07T20:32:07.6452359Z     ) -> None:
2025-05-07T20:32:07.6452580Z         torch.manual_seed(2025)
2025-05-07T20:32:07.6452831Z     
2025-05-07T20:32:07.6453116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.6453466Z     
2025-05-07T20:32:07.6453669Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.6453975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.6454299Z         x = x_sign * x_clamp
2025-05-07T20:32:07.6454544Z         x0 = x[:, :D]
2025-05-07T20:32:07.6454770Z         x1 = x[:, D:]
2025-05-07T20:32:07.6454991Z     
2025-05-07T20:32:07.6455186Z         if contiguous:
2025-05-07T20:32:07.6455427Z             x0 = x0.contiguous()
2025-05-07T20:32:07.6455695Z             x1 = x1.contiguous()
2025-05-07T20:32:07.6455947Z     
2025-05-07T20:32:07.6456146Z         if scale_ub is not None:
2025-05-07T20:32:07.6456434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.6456782Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.6457106Z             )
2025-05-07T20:32:07.6457308Z         else:
2025-05-07T20:32:07.6457548Z             scale_ub_tensor = None
2025-05-07T20:32:07.6457836Z     
2025-05-07T20:32:07.6458074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6458399Z             op = silu_mul_quant
2025-05-07T20:32:07.6458659Z             if compiled:
2025-05-07T20:32:07.6458918Z                 op = torch.compile(op)
2025-05-07T20:32:07.6459226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6459509Z     
2025-05-07T20:32:07.6459713Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.6459886Z 
2025-05-07T20:32:07.6459994Z moe/activation_test.py:117: 
2025-05-07T20:32:07.6460300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6460655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.6460952Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6461664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.6462380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.6462941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.6463653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.6464337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.6464890Z     kernel = self.compile(
2025-05-07T20:32:07.6465455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.6466141Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.6466647Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6466895Z 
2025-05-07T20:32:07.6467114Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3354950>
2025-05-07T20:32:07.6468264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.6469722Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2d11f80>}
2025-05-07T20:32:07.6471115Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.6472253Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1db2df0>
2025-05-07T20:32:07.6472562Z 
2025-05-07T20:32:07.6472741Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.6473287Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.6473771Z                            module_map=module_map)
2025-05-07T20:32:07.6474160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.6474531Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.6474793Z E       ^
2025-05-07T20:32:07.6475283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6475751Z 
2025-05-07T20:32:07.6476184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.6476717Z 
2025-05-07T20:32:07.6476829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.6477265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.6477684Z     T=128,
2025-05-07T20:32:07.6477877Z     D=5120,
2025-05-07T20:32:07.6478068Z     scale_ub=1200.0,
2025-05-07T20:32:07.6478296Z     contiguous=True,
2025-05-07T20:32:07.6478525Z     compiled=False,
2025-05-07T20:32:07.6478735Z )
2025-05-07T20:32:07.6479067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.6479579Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:07.6479860Z 
2025-05-07T20:32:07.6479942Z     @given(
2025-05-07T20:32:07.6480271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.6480597Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.6480916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.6481255Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.6481601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.6481907Z     )
2025-05-07T20:32:07.6482270Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.6482728Z     def test_silu_mul_quant(
2025-05-07T20:32:07.6482976Z         self,
2025-05-07T20:32:07.6483176Z         T: int,
2025-05-07T20:32:07.6483375Z         D: int,
2025-05-07T20:32:07.6483602Z         scale_ub: Optional[float],
2025-05-07T20:32:07.6483882Z         contiguous: bool,
2025-05-07T20:32:07.6484122Z         compiled: bool,
2025-05-07T20:32:07.6484350Z     ) -> None:
2025-05-07T20:32:07.6484575Z         torch.manual_seed(2025)
2025-05-07T20:32:07.6484820Z     
2025-05-07T20:32:07.6485106Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.6485458Z     
2025-05-07T20:32:07.6485655Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.6485956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.6486278Z         x = x_sign * x_clamp
2025-05-07T20:32:07.6486530Z         x0 = x[:, :D]
2025-05-07T20:32:07.6486842Z         x1 = x[:, D:]
2025-05-07T20:32:07.6487062Z     
2025-05-07T20:32:07.6487251Z         if contiguous:
2025-05-07T20:32:07.6487523Z             x0 = x0.contiguous()
2025-05-07T20:32:07.6487812Z             x1 = x1.contiguous()
2025-05-07T20:32:07.6488052Z     
2025-05-07T20:32:07.6488250Z         if scale_ub is not None:
2025-05-07T20:32:07.6488533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.6488881Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.6489197Z             )
2025-05-07T20:32:07.6489394Z         else:
2025-05-07T20:32:07.6489613Z             scale_ub_tensor = None
2025-05-07T20:32:07.6489869Z     
2025-05-07T20:32:07.6490111Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.6490436Z             op = silu_mul_quant
2025-05-07T20:32:07.6490693Z             if compiled:
2025-05-07T20:32:07.6490951Z                 op = torch.compile(op)
2025-05-07T20:32:07.6491350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6491638Z     
2025-05-07T20:32:07.6491848Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.6492018Z 
2025-05-07T20:32:07.6492137Z moe/activation_test.py:117: 
2025-05-07T20:32:07.6492441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6492788Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.6493079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.6493794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.6494502Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.6495062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.6495772Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.6496466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.6497017Z     kernel = self.compile(
2025-05-07T20:32:07.6497581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.6498297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.6498725Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.6498966Z 
2025-05-07T20:32:07.6499180Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f141ba0d0>
2025-05-07T20:32:07.6500294Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.6501715Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab1b20>}
2025-05-07T20:32:07.6503119Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.6504176Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1d77630>
2025-05-07T20:32:07.6504479Z 
2025-05-07T20:32:07.6504652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.6505193Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.6505678Z                            module_map=module_map)
2025-05-07T20:32:07.6506055Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.6506428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.6506692Z E       ^
2025-05-07T20:32:07.6507178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.6507735Z 
2025-05-07T20:32:07.6508177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.8014118Z 
2025-05-07T20:32:07.8014517Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.8015426Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.8016269Z     T=1,
2025-05-07T20:32:07.8016653Z     D=7168,
2025-05-07T20:32:07.8017119Z     scale_ub=1200.0,
2025-05-07T20:32:07.8017521Z     contiguous=True,
2025-05-07T20:32:07.8017748Z     compiled=True,
2025-05-07T20:32:07.8017962Z )
2025-05-07T20:32:07.8018300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.8018806Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:07.8019085Z 
2025-05-07T20:32:07.8019171Z     @given(
2025-05-07T20:32:07.8019614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.8019945Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.8020281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.8020630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.8020980Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.8021275Z     )
2025-05-07T20:32:07.8021641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.8022111Z     def test_silu_mul_quant(
2025-05-07T20:32:07.8022360Z         self,
2025-05-07T20:32:07.8022563Z         T: int,
2025-05-07T20:32:07.8022772Z         D: int,
2025-05-07T20:32:07.8022995Z         scale_ub: Optional[float],
2025-05-07T20:32:07.8023283Z         contiguous: bool,
2025-05-07T20:32:07.8023537Z         compiled: bool,
2025-05-07T20:32:07.8023769Z     ) -> None:
2025-05-07T20:32:07.8023996Z         torch.manual_seed(2025)
2025-05-07T20:32:07.8024259Z     
2025-05-07T20:32:07.8024544Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.8024905Z     
2025-05-07T20:32:07.8025112Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.8025417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.8025742Z         x = x_sign * x_clamp
2025-05-07T20:32:07.8025998Z         x0 = x[:, :D]
2025-05-07T20:32:07.8026230Z         x1 = x[:, D:]
2025-05-07T20:32:07.8026441Z     
2025-05-07T20:32:07.8026637Z         if contiguous:
2025-05-07T20:32:07.8026883Z             x0 = x0.contiguous()
2025-05-07T20:32:07.8027153Z             x1 = x1.contiguous()
2025-05-07T20:32:07.8027412Z     
2025-05-07T20:32:07.8027615Z         if scale_ub is not None:
2025-05-07T20:32:07.8027899Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.8028254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.8028583Z             )
2025-05-07T20:32:07.8028782Z         else:
2025-05-07T20:32:07.8029011Z             scale_ub_tensor = None
2025-05-07T20:32:07.8029275Z     
2025-05-07T20:32:07.8029520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.8029852Z             op = silu_mul_quant
2025-05-07T20:32:07.8030114Z             if compiled:
2025-05-07T20:32:07.8030372Z                 op = torch.compile(op)
2025-05-07T20:32:07.8030685Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.8030979Z     
2025-05-07T20:32:07.8031182Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.8031355Z 
2025-05-07T20:32:07.8031461Z moe/activation_test.py:117: 
2025-05-07T20:32:07.8031773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8032123Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.8032416Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.8033006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.8033590Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.8034403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.8035124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.8035687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.8036398Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.8037090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.8037662Z     kernel = self.compile(
2025-05-07T20:32:07.8038259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.8038948Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.8039364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8039685Z 
2025-05-07T20:32:07.8039906Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3476b50>
2025-05-07T20:32:07.8041172Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.8042602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab2840>}
2025-05-07T20:32:07.8043993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.8045056Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1da4cf0>
2025-05-07T20:32:07.8045365Z 
2025-05-07T20:32:07.8045548Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.8046105Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.8046591Z                            module_map=module_map)
2025-05-07T20:32:07.8046974Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.8047343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.8047614Z E       ^
2025-05-07T20:32:07.8048153Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.8048628Z 
2025-05-07T20:32:07.8049065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:07.8049603Z 
2025-05-07T20:32:07.8049720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:07.8050152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:07.8050582Z     T=1,
2025-05-07T20:32:07.8050782Z     D=7168,
2025-05-07T20:32:07.8050987Z     scale_ub=1200.0,
2025-05-07T20:32:07.8051222Z     contiguous=False,
2025-05-07T20:32:07.8051459Z     compiled=True,
2025-05-07T20:32:07.8051673Z )
2025-05-07T20:32:07.8052004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:07.8052517Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:07.8052796Z 
2025-05-07T20:32:07.8052880Z     @given(
2025-05-07T20:32:07.8053121Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:07.8053446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:07.8053768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:07.8054110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:07.8054461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:07.8054759Z     )
2025-05-07T20:32:07.8055123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:07.8055667Z     def test_silu_mul_quant(
2025-05-07T20:32:07.8055919Z         self,
2025-05-07T20:32:07.8056121Z         T: int,
2025-05-07T20:32:07.8056320Z         D: int,
2025-05-07T20:32:07.8056546Z         scale_ub: Optional[float],
2025-05-07T20:32:07.8056829Z         contiguous: bool,
2025-05-07T20:32:07.8057077Z         compiled: bool,
2025-05-07T20:32:07.8057309Z     ) -> None:
2025-05-07T20:32:07.8057534Z         torch.manual_seed(2025)
2025-05-07T20:32:07.8057781Z     
2025-05-07T20:32:07.8058071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:07.8058434Z     
2025-05-07T20:32:07.8058635Z         x_sign = torch.sign(x)
2025-05-07T20:32:07.8058942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:07.8059266Z         x = x_sign * x_clamp
2025-05-07T20:32:07.8059512Z         x0 = x[:, :D]
2025-05-07T20:32:07.8059743Z         x1 = x[:, D:]
2025-05-07T20:32:07.8059960Z     
2025-05-07T20:32:07.8060231Z         if contiguous:
2025-05-07T20:32:07.8060477Z             x0 = x0.contiguous()
2025-05-07T20:32:07.8060747Z             x1 = x1.contiguous()
2025-05-07T20:32:07.8060997Z     
2025-05-07T20:32:07.8061193Z         if scale_ub is not None:
2025-05-07T20:32:07.8061477Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:07.8061829Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:07.8062146Z             )
2025-05-07T20:32:07.8062354Z         else:
2025-05-07T20:32:07.8062580Z             scale_ub_tensor = None
2025-05-07T20:32:07.8062839Z     
2025-05-07T20:32:07.8063085Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:07.8063429Z             op = silu_mul_quant
2025-05-07T20:32:07.8063691Z             if compiled:
2025-05-07T20:32:07.8063951Z                 op = torch.compile(op)
2025-05-07T20:32:07.8064257Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.8064541Z     
2025-05-07T20:32:07.8064748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:07.8064919Z 
2025-05-07T20:32:07.8065035Z moe/activation_test.py:117: 
2025-05-07T20:32:07.8065342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8065686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:07.8065976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:07.8066557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:07.8067131Z     return fn(*args, **kwargs)
2025-05-07T20:32:07.8067817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:07.8068580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:07.8069139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:07.8069841Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:07.8070542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:07.8071093Z     kernel = self.compile(
2025-05-07T20:32:07.8071649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:07.8072332Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:07.8072752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:07.8072992Z 
2025-05-07T20:32:07.8073213Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f174fb8d0>
2025-05-07T20:32:07.8074329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:07.8075835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab1440>}
2025-05-07T20:32:07.8077235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:07.8078342Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1dc3df0>
2025-05-07T20:32:07.8078643Z 
2025-05-07T20:32:07.8078822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:07.8079367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:07.8079854Z                            module_map=module_map)
2025-05-07T20:32:07.8080308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:07.8080673Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:07.8080942Z E       ^
2025-05-07T20:32:07.8081512Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:07.8081979Z 
2025-05-07T20:32:07.8082418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.2337879Z 
2025-05-07T20:32:08.2338263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.2338770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.2339188Z     T=1,
2025-05-07T20:32:08.2339382Z     D=7168,
2025-05-07T20:32:08.2339584Z     scale_ub=None,
2025-05-07T20:32:08.2339804Z     contiguous=False,
2025-05-07T20:32:08.2340037Z     compiled=True,
2025-05-07T20:32:08.2340249Z )
2025-05-07T20:32:08.2340578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.2341085Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.2341371Z 
2025-05-07T20:32:08.2341457Z     @given(
2025-05-07T20:32:08.2341744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.2342072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.2342386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.2342731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.2343077Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.2343371Z     )
2025-05-07T20:32:08.2343737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.2344197Z     def test_silu_mul_quant(
2025-05-07T20:32:08.2344450Z         self,
2025-05-07T20:32:08.2344646Z         T: int,
2025-05-07T20:32:08.2344850Z         D: int,
2025-05-07T20:32:08.2345074Z         scale_ub: Optional[float],
2025-05-07T20:32:08.2345350Z         contiguous: bool,
2025-05-07T20:32:08.2345596Z         compiled: bool,
2025-05-07T20:32:08.2345831Z     ) -> None:
2025-05-07T20:32:08.2346055Z         torch.manual_seed(2025)
2025-05-07T20:32:08.2346305Z     
2025-05-07T20:32:08.2346593Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.2346945Z     
2025-05-07T20:32:08.2347145Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.2347453Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.2347822Z         x = x_sign * x_clamp
2025-05-07T20:32:08.2354048Z         x0 = x[:, :D]
2025-05-07T20:32:08.2354308Z         x1 = x[:, D:]
2025-05-07T20:32:08.2354530Z     
2025-05-07T20:32:08.2354726Z         if contiguous:
2025-05-07T20:32:08.2354971Z             x0 = x0.contiguous()
2025-05-07T20:32:08.2355238Z             x1 = x1.contiguous()
2025-05-07T20:32:08.2355482Z     
2025-05-07T20:32:08.2355684Z         if scale_ub is not None:
2025-05-07T20:32:08.2355971Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.2356321Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.2356645Z             )
2025-05-07T20:32:08.2356856Z         else:
2025-05-07T20:32:08.2357249Z             scale_ub_tensor = None
2025-05-07T20:32:08.2357515Z     
2025-05-07T20:32:08.2357760Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2358093Z             op = silu_mul_quant
2025-05-07T20:32:08.2358352Z             if compiled:
2025-05-07T20:32:08.2358610Z                 op = torch.compile(op)
2025-05-07T20:32:08.2358920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2359204Z     
2025-05-07T20:32:08.2359407Z         y_fp8, y_scale = fn()
2025-05-07T20:32:08.2359703Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:08.2360003Z     
2025-05-07T20:32:08.2360370Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2360723Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:08.2361026Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:08.2361357Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:08.2361859Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.2362195Z     
2025-05-07T20:32:08.2362403Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:08.2362610Z 
2025-05-07T20:32:08.2362716Z moe/activation_test.py:126: 
2025-05-07T20:32:08.2363027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2363375Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:08.2363727Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:08.2364557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:08.2365339Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:08.2365904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.2366616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.2367347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:08.2368096Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:08.2368860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:08.2369526Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:08.2370155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:08.2370690Z     fn()
2025-05-07T20:32:08.2371220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:08.2371821Z     self.fn.run(
2025-05-07T20:32:08.2372307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.2372860Z     kernel = self.compile(
2025-05-07T20:32:08.2373433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.2374116Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.2374530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2374773Z 
2025-05-07T20:32:08.2374990Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3220850>
2025-05-07T20:32:08.2376113Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.2377532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab3e20>}
2025-05-07T20:32:08.2379010Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.2380070Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1fa3830>
2025-05-07T20:32:08.2380375Z 
2025-05-07T20:32:08.2380549Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.2381096Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.2381586Z                            module_map=module_map)
2025-05-07T20:32:08.2381972Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.2382347Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:08.2382627Z E       ^
2025-05-07T20:32:08.2383104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2383713Z 
2025-05-07T20:32:08.2384152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.2384686Z 
2025-05-07T20:32:08.2384794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.2385229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.2385643Z     T=1,
2025-05-07T20:32:08.2385840Z     D=5120,
2025-05-07T20:32:08.2386042Z     scale_ub=1200.0,
2025-05-07T20:32:08.2386267Z     contiguous=False,
2025-05-07T20:32:08.2386502Z     compiled=True,
2025-05-07T20:32:08.2386715Z )
2025-05-07T20:32:08.2387044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.2387558Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.2387835Z 
2025-05-07T20:32:08.2387919Z     @given(
2025-05-07T20:32:08.2388151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.2388484Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.2388810Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.2389156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.2389499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.2389798Z     )
2025-05-07T20:32:08.2390165Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.2390619Z     def test_silu_mul_quant(
2025-05-07T20:32:08.2390871Z         self,
2025-05-07T20:32:08.2391076Z         T: int,
2025-05-07T20:32:08.2391278Z         D: int,
2025-05-07T20:32:08.2391508Z         scale_ub: Optional[float],
2025-05-07T20:32:08.2391790Z         contiguous: bool,
2025-05-07T20:32:08.2392037Z         compiled: bool,
2025-05-07T20:32:08.2392266Z     ) -> None:
2025-05-07T20:32:08.2392490Z         torch.manual_seed(2025)
2025-05-07T20:32:08.2392741Z     
2025-05-07T20:32:08.2393030Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.2393396Z     
2025-05-07T20:32:08.2393600Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.2393907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.2394231Z         x = x_sign * x_clamp
2025-05-07T20:32:08.2394478Z         x0 = x[:, :D]
2025-05-07T20:32:08.2394699Z         x1 = x[:, D:]
2025-05-07T20:32:08.2394918Z     
2025-05-07T20:32:08.2395116Z         if contiguous:
2025-05-07T20:32:08.2395354Z             x0 = x0.contiguous()
2025-05-07T20:32:08.2395619Z             x1 = x1.contiguous()
2025-05-07T20:32:08.2395870Z     
2025-05-07T20:32:08.2396071Z         if scale_ub is not None:
2025-05-07T20:32:08.2396359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.2396714Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.2397033Z             )
2025-05-07T20:32:08.2397237Z         else:
2025-05-07T20:32:08.2397453Z             scale_ub_tensor = None
2025-05-07T20:32:08.2397714Z     
2025-05-07T20:32:08.2397962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.2398431Z             op = silu_mul_quant
2025-05-07T20:32:08.2398688Z             if compiled:
2025-05-07T20:32:08.2398948Z                 op = torch.compile(op)
2025-05-07T20:32:08.2399270Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2399560Z     
2025-05-07T20:32:08.2399754Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.2399935Z 
2025-05-07T20:32:08.2400036Z moe/activation_test.py:117: 
2025-05-07T20:32:08.2400432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2400775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.2401073Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.2401654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.2402231Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.2402921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.2403723Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.2404284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.2404991Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.2405681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.2406235Z     kernel = self.compile(
2025-05-07T20:32:08.2406804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.2407482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.2407943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.2408187Z 
2025-05-07T20:32:08.2408408Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1fdfcd0>
2025-05-07T20:32:08.2409530Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.2410954Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef25e79c0>}
2025-05-07T20:32:08.2412349Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.2413674Z context = <triton._C.libtriton.ir.context object at 0x7f1ef23cc3f0>
2025-05-07T20:32:08.2413977Z 
2025-05-07T20:32:08.2414157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.2414713Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.2415204Z                            module_map=module_map)
2025-05-07T20:32:08.2415586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.2415962Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.2416229Z E       ^
2025-05-07T20:32:08.2416712Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.2417180Z 
2025-05-07T20:32:08.2417620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3789612Z 
2025-05-07T20:32:08.3790011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3790619Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3791212Z     T=1,
2025-05-07T20:32:08.3791463Z     D=5120,
2025-05-07T20:32:08.3791688Z     scale_ub=1200.0,
2025-05-07T20:32:08.3791925Z     contiguous=False,
2025-05-07T20:32:08.3792337Z     compiled=False,
2025-05-07T20:32:08.3792560Z )
2025-05-07T20:32:08.3792902Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3793422Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.3793705Z 
2025-05-07T20:32:08.3793790Z     @given(
2025-05-07T20:32:08.3794028Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3794356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3794680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3795022Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3795364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3795665Z     )
2025-05-07T20:32:08.3796026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3796489Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3796901Z         self,
2025-05-07T20:32:08.3797114Z         T: int,
2025-05-07T20:32:08.3797326Z         D: int,
2025-05-07T20:32:08.3797563Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3797883Z         contiguous: bool,
2025-05-07T20:32:08.3798143Z         compiled: bool,
2025-05-07T20:32:08.3798379Z     ) -> None:
2025-05-07T20:32:08.3798602Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3798851Z     
2025-05-07T20:32:08.3799138Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3799495Z     
2025-05-07T20:32:08.3799699Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3800005Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3800455Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3800699Z         x0 = x[:, :D]
2025-05-07T20:32:08.3800932Z         x1 = x[:, D:]
2025-05-07T20:32:08.3801153Z     
2025-05-07T20:32:08.3801344Z         if contiguous:
2025-05-07T20:32:08.3801584Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3801861Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3802119Z     
2025-05-07T20:32:08.3802316Z         if scale_ub is not None:
2025-05-07T20:32:08.3802600Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3802954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3803278Z             )
2025-05-07T20:32:08.3803484Z         else:
2025-05-07T20:32:08.3803706Z             scale_ub_tensor = None
2025-05-07T20:32:08.3803967Z     
2025-05-07T20:32:08.3804210Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3804538Z             op = silu_mul_quant
2025-05-07T20:32:08.3804798Z             if compiled:
2025-05-07T20:32:08.3805059Z                 op = torch.compile(op)
2025-05-07T20:32:08.3805372Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3805657Z     
2025-05-07T20:32:08.3805860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.3806032Z 
2025-05-07T20:32:08.3806139Z moe/activation_test.py:117: 
2025-05-07T20:32:08.3806462Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3806806Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.3807102Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3807876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.3808591Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.3809156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3809876Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3810570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3811122Z     kernel = self.compile(
2025-05-07T20:32:08.3811687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3812464Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3812887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3813130Z 
2025-05-07T20:32:08.3813516Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f287bcad0>
2025-05-07T20:32:08.3814645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3816070Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef381afc0>}
2025-05-07T20:32:08.3817475Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3819148Z context = <triton._C.libtriton.ir.context object at 0x7f1ef18b05f0>
2025-05-07T20:32:08.3819455Z 
2025-05-07T20:32:08.3819632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3820180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3820671Z                            module_map=module_map)
2025-05-07T20:32:08.3821050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3821424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3821704Z E       ^
2025-05-07T20:32:08.3822186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3822660Z 
2025-05-07T20:32:08.3823096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3823638Z 
2025-05-07T20:32:08.3823753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3824191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3824608Z     T=16384,
2025-05-07T20:32:08.3824811Z     D=5120,
2025-05-07T20:32:08.3825013Z     scale_ub=1200.0,
2025-05-07T20:32:08.3825244Z     contiguous=False,
2025-05-07T20:32:08.3825483Z     compiled=True,
2025-05-07T20:32:08.3825692Z )
2025-05-07T20:32:08.3826021Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.3826553Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.3826848Z 
2025-05-07T20:32:08.3826932Z     @given(
2025-05-07T20:32:08.3827168Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.3827496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.3827818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.3828179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.3828562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.3828864Z     )
2025-05-07T20:32:08.3829233Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.3829691Z     def test_silu_mul_quant(
2025-05-07T20:32:08.3829943Z         self,
2025-05-07T20:32:08.3830143Z         T: int,
2025-05-07T20:32:08.3830347Z         D: int,
2025-05-07T20:32:08.3830581Z         scale_ub: Optional[float],
2025-05-07T20:32:08.3830870Z         contiguous: bool,
2025-05-07T20:32:08.3831118Z         compiled: bool,
2025-05-07T20:32:08.3831355Z     ) -> None:
2025-05-07T20:32:08.3831579Z         torch.manual_seed(2025)
2025-05-07T20:32:08.3831830Z     
2025-05-07T20:32:08.3832116Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.3832474Z     
2025-05-07T20:32:08.3832682Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.3832991Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.3833447Z         x = x_sign * x_clamp
2025-05-07T20:32:08.3833706Z         x0 = x[:, :D]
2025-05-07T20:32:08.3833929Z         x1 = x[:, D:]
2025-05-07T20:32:08.3834148Z     
2025-05-07T20:32:08.3834341Z         if contiguous:
2025-05-07T20:32:08.3834579Z             x0 = x0.contiguous()
2025-05-07T20:32:08.3834852Z             x1 = x1.contiguous()
2025-05-07T20:32:08.3835105Z     
2025-05-07T20:32:08.3835300Z         if scale_ub is not None:
2025-05-07T20:32:08.3835587Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.3835942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.3836263Z             )
2025-05-07T20:32:08.3836467Z         else:
2025-05-07T20:32:08.3836688Z             scale_ub_tensor = None
2025-05-07T20:32:08.3836947Z     
2025-05-07T20:32:08.3837191Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.3837521Z             op = silu_mul_quant
2025-05-07T20:32:08.3837867Z             if compiled:
2025-05-07T20:32:08.3838126Z                 op = torch.compile(op)
2025-05-07T20:32:08.3838448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3838776Z     
2025-05-07T20:32:08.3838974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.3839150Z 
2025-05-07T20:32:08.3839257Z moe/activation_test.py:117: 
2025-05-07T20:32:08.3839566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3839913Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.3840295Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.3840880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.3841461Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.3842144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.3842858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.3843432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.3844142Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.3844836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.3845395Z     kernel = self.compile(
2025-05-07T20:32:08.3845965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.3846647Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.3847080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.3847321Z 
2025-05-07T20:32:08.3847545Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f164e1c50>
2025-05-07T20:32:08.3848671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.3850097Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef3a93ce0>}
2025-05-07T20:32:08.3851494Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.3852557Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1896570>
2025-05-07T20:32:08.3852858Z 
2025-05-07T20:32:08.3853038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.3853579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.3854075Z                            module_map=module_map)
2025-05-07T20:32:08.3854544Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.3854921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.3855190Z E       ^
2025-05-07T20:32:08.3855679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.3856147Z 
2025-05-07T20:32:08.3856583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.3857115Z 
2025-05-07T20:32:08.3857224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.3857661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.3858086Z     T=2048,
2025-05-07T20:32:08.3858301Z     D=7168,
2025-05-07T20:32:08.3858529Z     scale_ub=1200.0,
2025-05-07T20:32:08.3858764Z     contiguous=False,
2025-05-07T20:32:08.3859000Z     compiled=True,
2025-05-07T20:32:08.5689070Z )
2025-05-07T20:32:08.5690302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5691176Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.5691603Z 
2025-05-07T20:32:08.5691724Z     @given(
2025-05-07T20:32:08.5692066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5692524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5692979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5693356Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5693705Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5694020Z     )
2025-05-07T20:32:08.5694401Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5694864Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5695130Z         self,
2025-05-07T20:32:08.5695346Z         T: int,
2025-05-07T20:32:08.5695570Z         D: int,
2025-05-07T20:32:08.5695815Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5696113Z         contiguous: bool,
2025-05-07T20:32:08.5696378Z         compiled: bool,
2025-05-07T20:32:08.5696616Z     ) -> None:
2025-05-07T20:32:08.5696848Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5697116Z     
2025-05-07T20:32:08.5697413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5697796Z     
2025-05-07T20:32:08.5698013Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5698321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5698664Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5698932Z         x0 = x[:, :D]
2025-05-07T20:32:08.5699163Z         x1 = x[:, D:]
2025-05-07T20:32:08.5699398Z     
2025-05-07T20:32:08.5699608Z         if contiguous:
2025-05-07T20:32:08.5699856Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5700143Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5700409Z     
2025-05-07T20:32:08.5700622Z         if scale_ub is not None:
2025-05-07T20:32:08.5700926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5701294Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5701632Z             )
2025-05-07T20:32:08.5701839Z         else:
2025-05-07T20:32:08.5702071Z             scale_ub_tensor = None
2025-05-07T20:32:08.5702346Z     
2025-05-07T20:32:08.5702593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5702934Z             op = silu_mul_quant
2025-05-07T20:32:08.5703208Z             if compiled:
2025-05-07T20:32:08.5703475Z                 op = torch.compile(op)
2025-05-07T20:32:08.5703796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5704099Z     
2025-05-07T20:32:08.5704303Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5704490Z 
2025-05-07T20:32:08.5704601Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5704924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5705292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5705972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5706584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.5707183Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.5707920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5708659Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5709254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5709982Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5710690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5720046Z     kernel = self.compile(
2025-05-07T20:32:08.5721002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5721723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5722154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5722414Z 
2025-05-07T20:32:08.5722637Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f287bdad0>
2025-05-07T20:32:08.5723796Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5725276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f29aac720>}
2025-05-07T20:32:08.5726703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5727787Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1875930>
2025-05-07T20:32:08.5728104Z 
2025-05-07T20:32:08.5728284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5728847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5729353Z                            module_map=module_map)
2025-05-07T20:32:08.5729743Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5730128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5730416Z E       ^
2025-05-07T20:32:08.5730912Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5731399Z 
2025-05-07T20:32:08.5731848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.5732407Z 
2025-05-07T20:32:08.5732520Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.5732970Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.5733398Z     T=1,
2025-05-07T20:32:08.5733604Z     D=5120,
2025-05-07T20:32:08.5733818Z     scale_ub=None,
2025-05-07T20:32:08.5734051Z     contiguous=False,
2025-05-07T20:32:08.5734304Z     compiled=False,
2025-05-07T20:32:08.5734538Z )
2025-05-07T20:32:08.5734877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5735408Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:08.5735699Z 
2025-05-07T20:32:08.5735785Z     @given(
2025-05-07T20:32:08.5736043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5736378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5736724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5737214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5737574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5737917Z     )
2025-05-07T20:32:08.5738324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5738795Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5739069Z         self,
2025-05-07T20:32:08.5739289Z         T: int,
2025-05-07T20:32:08.5739502Z         D: int,
2025-05-07T20:32:08.5739748Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5740051Z         contiguous: bool,
2025-05-07T20:32:08.5740310Z         compiled: bool,
2025-05-07T20:32:08.5740565Z     ) -> None:
2025-05-07T20:32:08.5740806Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5741076Z     
2025-05-07T20:32:08.5741370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5741744Z     
2025-05-07T20:32:08.5742049Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5742371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5742715Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5742987Z         x0 = x[:, :D]
2025-05-07T20:32:08.5743227Z         x1 = x[:, D:]
2025-05-07T20:32:08.5743459Z     
2025-05-07T20:32:08.5743669Z         if contiguous:
2025-05-07T20:32:08.5743917Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5744209Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5744480Z     
2025-05-07T20:32:08.5744687Z         if scale_ub is not None:
2025-05-07T20:32:08.5744990Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5745362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5745695Z             )
2025-05-07T20:32:08.5745917Z         else:
2025-05-07T20:32:08.5746153Z             scale_ub_tensor = None
2025-05-07T20:32:08.5746422Z     
2025-05-07T20:32:08.5746683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5747038Z             op = silu_mul_quant
2025-05-07T20:32:08.5747321Z             if compiled:
2025-05-07T20:32:08.5747589Z                 op = torch.compile(op)
2025-05-07T20:32:08.5747915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5748216Z     
2025-05-07T20:32:08.5748422Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5748607Z 
2025-05-07T20:32:08.5748713Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5749036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5749389Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5749699Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5750434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5751165Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5751739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5752481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5753195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5753757Z     kernel = self.compile(
2025-05-07T20:32:08.5754338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5755040Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5755473Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5755717Z 
2025-05-07T20:32:08.5755938Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a504d0>
2025-05-07T20:32:08.5757076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5758637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2d125c0>}
2025-05-07T20:32:08.5760052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5761193Z context = <triton._C.libtriton.ir.context object at 0x7f1ef24a6630>
2025-05-07T20:32:08.5761499Z 
2025-05-07T20:32:08.5761679Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5762240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5762741Z                            module_map=module_map)
2025-05-07T20:32:08.5763130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5763597Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5763889Z E       ^
2025-05-07T20:32:08.5764390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5764862Z 
2025-05-07T20:32:08.5765299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.5765848Z 
2025-05-07T20:32:08.5765962Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.5766441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.5766872Z     T=4096,
2025-05-07T20:32:08.5767071Z     D=7168,
2025-05-07T20:32:08.5767287Z     scale_ub=1200.0,
2025-05-07T20:32:08.5767533Z     contiguous=False,
2025-05-07T20:32:08.5767775Z     compiled=False,
2025-05-07T20:32:08.5768005Z )
2025-05-07T20:32:08.5768358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.5768910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.5769202Z 
2025-05-07T20:32:08.5769287Z     @given(
2025-05-07T20:32:08.5769543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.5769885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.5770213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.5770576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.5770937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.5771240Z     )
2025-05-07T20:32:08.5771619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.5772095Z     def test_silu_mul_quant(
2025-05-07T20:32:08.5772363Z         self,
2025-05-07T20:32:08.5772571Z         T: int,
2025-05-07T20:32:08.5772791Z         D: int,
2025-05-07T20:32:08.5773033Z         scale_ub: Optional[float],
2025-05-07T20:32:08.5773318Z         contiguous: bool,
2025-05-07T20:32:08.5773586Z         compiled: bool,
2025-05-07T20:32:08.5773829Z     ) -> None:
2025-05-07T20:32:08.5774060Z         torch.manual_seed(2025)
2025-05-07T20:32:08.5774327Z     
2025-05-07T20:32:08.5774625Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.5774985Z     
2025-05-07T20:32:08.5775205Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.5775521Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.5775850Z         x = x_sign * x_clamp
2025-05-07T20:32:08.5776114Z         x0 = x[:, :D]
2025-05-07T20:32:08.5776353Z         x1 = x[:, D:]
2025-05-07T20:32:08.5776571Z     
2025-05-07T20:32:08.5776779Z         if contiguous:
2025-05-07T20:32:08.5777035Z             x0 = x0.contiguous()
2025-05-07T20:32:08.5777308Z             x1 = x1.contiguous()
2025-05-07T20:32:08.5777576Z     
2025-05-07T20:32:08.5777814Z         if scale_ub is not None:
2025-05-07T20:32:08.5778131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.5778494Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.5778911Z             )
2025-05-07T20:32:08.5779125Z         else:
2025-05-07T20:32:08.5779352Z             scale_ub_tensor = None
2025-05-07T20:32:08.5779628Z     
2025-05-07T20:32:08.5779883Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.5780216Z             op = silu_mul_quant
2025-05-07T20:32:08.5780489Z             if compiled:
2025-05-07T20:32:08.5780760Z                 op = torch.compile(op)
2025-05-07T20:32:08.5781074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5781377Z     
2025-05-07T20:32:08.5781595Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.5781770Z 
2025-05-07T20:32:08.5781878Z moe/activation_test.py:117: 
2025-05-07T20:32:08.5782199Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5782557Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.5782864Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.5783666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.5784394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.5784955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.5785673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.5786378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.5786943Z     kernel = self.compile(
2025-05-07T20:32:08.5787504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.5788196Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.5788618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.5788863Z 
2025-05-07T20:32:08.5789092Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f141b8fd0>
2025-05-07T20:32:08.5790209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.5791636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f60900>}
2025-05-07T20:32:08.5793039Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.5794110Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1e2c1f0>
2025-05-07T20:32:08.5794413Z 
2025-05-07T20:32:08.5794595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.5795150Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.5795645Z                            module_map=module_map)
2025-05-07T20:32:08.5796035Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.5796407Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.5796686Z E       ^
2025-05-07T20:32:08.5797179Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.5797647Z 
2025-05-07T20:32:08.5798130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.7331801Z 
2025-05-07T20:32:08.7332133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.7332792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.7333435Z     T=16384,
2025-05-07T20:32:08.7333722Z     D=7168,
2025-05-07T20:32:08.7334403Z     scale_ub=None,
2025-05-07T20:32:08.7334720Z     contiguous=True,
2025-05-07T20:32:08.7335052Z     compiled=True,
2025-05-07T20:32:08.7335357Z )
2025-05-07T20:32:08.7335816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.7336527Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:08.7336823Z 
2025-05-07T20:32:08.7336919Z     @given(
2025-05-07T20:32:08.7337165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.7337507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.7337842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.7338194Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.7338553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.7338863Z     )
2025-05-07T20:32:08.7339239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.7340356Z     def test_silu_mul_quant(
2025-05-07T20:32:08.7340627Z         self,
2025-05-07T20:32:08.7340841Z         T: int,
2025-05-07T20:32:08.7341053Z         D: int,
2025-05-07T20:32:08.7341291Z         scale_ub: Optional[float],
2025-05-07T20:32:08.7341587Z         contiguous: bool,
2025-05-07T20:32:08.7341842Z         compiled: bool,
2025-05-07T20:32:08.7342092Z     ) -> None:
2025-05-07T20:32:08.7342333Z         torch.manual_seed(2025)
2025-05-07T20:32:08.7342592Z     
2025-05-07T20:32:08.7342891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.7343264Z     
2025-05-07T20:32:08.7343471Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.7343790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.7344126Z         x = x_sign * x_clamp
2025-05-07T20:32:08.7344382Z         x0 = x[:, :D]
2025-05-07T20:32:08.7344622Z         x1 = x[:, D:]
2025-05-07T20:32:08.7344854Z     
2025-05-07T20:32:08.7345063Z         if contiguous:
2025-05-07T20:32:08.7345324Z             x0 = x0.contiguous()
2025-05-07T20:32:08.7345612Z             x1 = x1.contiguous()
2025-05-07T20:32:08.7345877Z     
2025-05-07T20:32:08.7346085Z         if scale_ub is not None:
2025-05-07T20:32:08.7346384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.7346751Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.7347084Z             )
2025-05-07T20:32:08.7347303Z         else:
2025-05-07T20:32:08.7347538Z             scale_ub_tensor = None
2025-05-07T20:32:08.7347831Z     
2025-05-07T20:32:08.7348109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.7348451Z             op = silu_mul_quant
2025-05-07T20:32:08.7348717Z             if compiled:
2025-05-07T20:32:08.7348992Z                 op = torch.compile(op)
2025-05-07T20:32:08.7349320Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7349613Z     
2025-05-07T20:32:08.7349829Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.7350022Z 
2025-05-07T20:32:08.7350158Z moe/activation_test.py:117: 
2025-05-07T20:32:08.7350484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7350847Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.7351145Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7351745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.7352338Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.7353029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.7353757Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.7354335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.7355059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.7355856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.7356428Z     kernel = self.compile(
2025-05-07T20:32:08.7357008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.7357700Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.7358133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7358385Z 
2025-05-07T20:32:08.7358606Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef22f5450>
2025-05-07T20:32:08.7359746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.7361317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f61c60>}
2025-05-07T20:32:08.7362795Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.7363870Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1e34a30>
2025-05-07T20:32:08.7364184Z 
2025-05-07T20:32:08.7364364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.7364926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.7365421Z                            module_map=module_map)
2025-05-07T20:32:08.7365815Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.7366201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.7366478Z E       ^
2025-05-07T20:32:08.7366987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.7367465Z 
2025-05-07T20:32:08.7367928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.7368494Z 
2025-05-07T20:32:08.7368615Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.7369055Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.7369490Z     T=4096,
2025-05-07T20:32:08.7369699Z     D=5120,
2025-05-07T20:32:08.7369903Z     scale_ub=None,
2025-05-07T20:32:08.7370141Z     contiguous=False,
2025-05-07T20:32:08.7370388Z     compiled=True,
2025-05-07T20:32:08.7370603Z )
2025-05-07T20:32:08.7370948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.7371479Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:08.7371767Z 
2025-05-07T20:32:08.7371866Z     @given(
2025-05-07T20:32:08.7372115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.7372458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.7372790Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.7373141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.7373500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.7373813Z     )
2025-05-07T20:32:08.7374186Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.7374659Z     def test_silu_mul_quant(
2025-05-07T20:32:08.7374923Z         self,
2025-05-07T20:32:08.7375130Z         T: int,
2025-05-07T20:32:08.7375350Z         D: int,
2025-05-07T20:32:08.7375592Z         scale_ub: Optional[float],
2025-05-07T20:32:08.7375886Z         contiguous: bool,
2025-05-07T20:32:08.7376139Z         compiled: bool,
2025-05-07T20:32:08.7376382Z     ) -> None:
2025-05-07T20:32:08.7376617Z         torch.manual_seed(2025)
2025-05-07T20:32:08.7376878Z     
2025-05-07T20:32:08.7377258Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.7377634Z     
2025-05-07T20:32:08.7377840Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.7378154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.7378490Z         x = x_sign * x_clamp
2025-05-07T20:32:08.7378744Z         x0 = x[:, :D]
2025-05-07T20:32:08.7378982Z         x1 = x[:, D:]
2025-05-07T20:32:08.7379211Z     
2025-05-07T20:32:08.7379411Z         if contiguous:
2025-05-07T20:32:08.7379663Z             x0 = x0.contiguous()
2025-05-07T20:32:08.7379945Z             x1 = x1.contiguous()
2025-05-07T20:32:08.7380201Z     
2025-05-07T20:32:08.7380414Z         if scale_ub is not None:
2025-05-07T20:32:08.7380713Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.7381068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.7381410Z             )
2025-05-07T20:32:08.7381711Z         else:
2025-05-07T20:32:08.7381940Z             scale_ub_tensor = None
2025-05-07T20:32:08.7382209Z     
2025-05-07T20:32:08.7382458Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.7382794Z             op = silu_mul_quant
2025-05-07T20:32:08.7383057Z             if compiled:
2025-05-07T20:32:08.7383322Z                 op = torch.compile(op)
2025-05-07T20:32:08.7383640Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7383932Z     
2025-05-07T20:32:08.7384141Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.7384313Z 
2025-05-07T20:32:08.7384425Z moe/activation_test.py:117: 
2025-05-07T20:32:08.7384736Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7385090Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.7385391Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.7385980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.7386566Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.7387268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.7387992Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.7388552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.7389268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.7389969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.7390535Z     kernel = self.compile(
2025-05-07T20:32:08.7391111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.7391807Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.7392237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.7392493Z 
2025-05-07T20:32:08.7392712Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef25fc1d0>
2025-05-07T20:32:08.7393838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.7395268Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f62980>}
2025-05-07T20:32:08.7396666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.7397727Z context = <triton._C.libtriton.ir.context object at 0x7f1ef21057f0>
2025-05-07T20:32:08.7398042Z 
2025-05-07T20:32:08.7398306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.7398864Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.7399361Z                            module_map=module_map)
2025-05-07T20:32:08.7399744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.7400202Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.7400482Z E       ^
2025-05-07T20:32:08.7400971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.7401446Z 
2025-05-07T20:32:08.7401880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.8755663Z 
2025-05-07T20:32:08.8756098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.8756799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.8757668Z     T=4096,
2025-05-07T20:32:08.8758068Z     D=5120,
2025-05-07T20:32:08.8758457Z     scale_ub=1200.0,
2025-05-07T20:32:08.8758922Z     contiguous=False,
2025-05-07T20:32:08.8759389Z     compiled=False,
2025-05-07T20:32:08.8759808Z )
2025-05-07T20:32:08.8760592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.8761633Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:08.8762205Z 
2025-05-07T20:32:08.8762365Z     @given(
2025-05-07T20:32:08.8762840Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.8763489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.8764119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.8764842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.8765524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.8766121Z     )
2025-05-07T20:32:08.8766871Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.8767731Z     def test_silu_mul_quant(
2025-05-07T20:32:08.8768028Z         self,
2025-05-07T20:32:08.8768257Z         T: int,
2025-05-07T20:32:08.8776755Z         D: int,
2025-05-07T20:32:08.8777009Z         scale_ub: Optional[float],
2025-05-07T20:32:08.8777309Z         contiguous: bool,
2025-05-07T20:32:08.8777565Z         compiled: bool,
2025-05-07T20:32:08.8777816Z     ) -> None:
2025-05-07T20:32:08.8778057Z         torch.manual_seed(2025)
2025-05-07T20:32:08.8778316Z     
2025-05-07T20:32:08.8778617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.8778988Z     
2025-05-07T20:32:08.8779199Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.8779507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.8779845Z         x = x_sign * x_clamp
2025-05-07T20:32:08.8780103Z         x0 = x[:, :D]
2025-05-07T20:32:08.8780342Z         x1 = x[:, D:]
2025-05-07T20:32:08.8780566Z     
2025-05-07T20:32:08.8780774Z         if contiguous:
2025-05-07T20:32:08.8781019Z             x0 = x0.contiguous()
2025-05-07T20:32:08.8781300Z             x1 = x1.contiguous()
2025-05-07T20:32:08.8781558Z     
2025-05-07T20:32:08.8781759Z         if scale_ub is not None:
2025-05-07T20:32:08.8782056Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.8782420Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.8782744Z             )
2025-05-07T20:32:08.8782960Z         else:
2025-05-07T20:32:08.8783187Z             scale_ub_tensor = None
2025-05-07T20:32:08.8783452Z     
2025-05-07T20:32:08.8783705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.8784040Z             op = silu_mul_quant
2025-05-07T20:32:08.8784305Z             if compiled:
2025-05-07T20:32:08.8784564Z                 op = torch.compile(op)
2025-05-07T20:32:08.8784882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.8785183Z     
2025-05-07T20:32:08.8785382Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.8785779Z 
2025-05-07T20:32:08.8785891Z moe/activation_test.py:117: 
2025-05-07T20:32:08.8786207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.8786555Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.8786855Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.8787585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.8788307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.8788867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.8789584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.8790287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.8790924Z     kernel = self.compile(
2025-05-07T20:32:08.8791502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.8792190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.8792613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.8792852Z 
2025-05-07T20:32:08.8793070Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a05250>
2025-05-07T20:32:08.8794202Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.8795658Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f63ba0>}
2025-05-07T20:32:08.8797076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.8798201Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1abbd70>
2025-05-07T20:32:08.8798503Z 
2025-05-07T20:32:08.8798681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.8799230Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.8799725Z                            module_map=module_map)
2025-05-07T20:32:08.8800171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.8800552Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.8800831Z E       ^
2025-05-07T20:32:08.8801317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.8801802Z 
2025-05-07T20:32:08.8802244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.8802786Z 
2025-05-07T20:32:08.8802896Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.8803332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.8803751Z     T=4096,
2025-05-07T20:32:08.8803956Z     D=5120,
2025-05-07T20:32:08.8804162Z     scale_ub=1200.0,
2025-05-07T20:32:08.8804396Z     contiguous=False,
2025-05-07T20:32:08.8804642Z     compiled=True,
2025-05-07T20:32:08.8804860Z )
2025-05-07T20:32:08.8805201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:08.8805716Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:08.8806010Z 
2025-05-07T20:32:08.8806091Z     @given(
2025-05-07T20:32:08.8806340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:08.8806669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:08.8807087Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:08.8807442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:08.8807784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:08.8808088Z     )
2025-05-07T20:32:08.8808458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:08.8808922Z     def test_silu_mul_quant(
2025-05-07T20:32:08.8809171Z         self,
2025-05-07T20:32:08.8809381Z         T: int,
2025-05-07T20:32:08.8809592Z         D: int,
2025-05-07T20:32:08.8809818Z         scale_ub: Optional[float],
2025-05-07T20:32:08.8810105Z         contiguous: bool,
2025-05-07T20:32:08.8810358Z         compiled: bool,
2025-05-07T20:32:08.8810588Z     ) -> None:
2025-05-07T20:32:08.8810818Z         torch.manual_seed(2025)
2025-05-07T20:32:08.8811074Z     
2025-05-07T20:32:08.8811360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:08.8811846Z     
2025-05-07T20:32:08.8812058Z         x_sign = torch.sign(x)
2025-05-07T20:32:08.8812360Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:08.8812685Z         x = x_sign * x_clamp
2025-05-07T20:32:08.8812938Z         x0 = x[:, :D]
2025-05-07T20:32:08.8813165Z         x1 = x[:, D:]
2025-05-07T20:32:08.8813716Z     
2025-05-07T20:32:08.8813918Z         if contiguous:
2025-05-07T20:32:08.8814156Z             x0 = x0.contiguous()
2025-05-07T20:32:08.8814433Z             x1 = x1.contiguous()
2025-05-07T20:32:08.8814688Z     
2025-05-07T20:32:08.8814900Z         if scale_ub is not None:
2025-05-07T20:32:08.8815188Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:08.8815546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:08.8815872Z             )
2025-05-07T20:32:08.8816071Z         else:
2025-05-07T20:32:08.8816298Z             scale_ub_tensor = None
2025-05-07T20:32:08.8816571Z     
2025-05-07T20:32:08.8816814Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:08.8817153Z             op = silu_mul_quant
2025-05-07T20:32:08.8817422Z             if compiled:
2025-05-07T20:32:08.8817680Z                 op = torch.compile(op)
2025-05-07T20:32:08.8817994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.8818288Z     
2025-05-07T20:32:08.8818486Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:08.8818669Z 
2025-05-07T20:32:08.8818772Z moe/activation_test.py:117: 
2025-05-07T20:32:08.8819087Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.8819439Z moe/activation_test.py:115: in fn
2025-05-07T20:32:08.8819732Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:08.8820318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:08.8820902Z     return fn(*args, **kwargs)
2025-05-07T20:32:08.8821586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:08.8822314Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:08.8822876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:08.8823586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:08.8824275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:08.8824837Z     kernel = self.compile(
2025-05-07T20:32:08.8825405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:08.8826082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:08.8826501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:08.8826746Z 
2025-05-07T20:32:08.8826970Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef2ab7350>
2025-05-07T20:32:08.8828284Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:08.8829718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2130ea0>}
2025-05-07T20:32:08.8831107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:08.8832173Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1a42d70>
2025-05-07T20:32:08.8832480Z 
2025-05-07T20:32:08.8832655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:08.8833334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:08.8833817Z                            module_map=module_map)
2025-05-07T20:32:08.8834205Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:08.8834582Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:08.8834849Z E       ^
2025-05-07T20:32:08.8835338Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:08.8835811Z 
2025-05-07T20:32:08.8836244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:08.8836774Z 
2025-05-07T20:32:08.8836889Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:08.8837319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:08.8837745Z     T=2048,
2025-05-07T20:32:08.8837949Z     D=7168,
2025-05-07T20:32:08.8838181Z     scale_ub=1200.0,
2025-05-07T20:32:08.8838446Z     contiguous=False,
2025-05-07T20:32:08.8838693Z     compiled=False,
2025-05-07T20:32:09.0771533Z )
2025-05-07T20:32:09.0772010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0772816Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.0773219Z 
2025-05-07T20:32:09.0773338Z     @given(
2025-05-07T20:32:09.0773597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0773926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0774260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0774612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0774956Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0775265Z     )
2025-05-07T20:32:09.0775637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0776127Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0776392Z         self,
2025-05-07T20:32:09.0776610Z         T: int,
2025-05-07T20:32:09.0776814Z         D: int,
2025-05-07T20:32:09.0777047Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0777333Z         contiguous: bool,
2025-05-07T20:32:09.0777591Z         compiled: bool,
2025-05-07T20:32:09.0777839Z     ) -> None:
2025-05-07T20:32:09.0778099Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0778355Z     
2025-05-07T20:32:09.0778637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0778998Z     
2025-05-07T20:32:09.0779203Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0779506Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0779835Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0780092Z         x0 = x[:, :D]
2025-05-07T20:32:09.0780315Z         x1 = x[:, D:]
2025-05-07T20:32:09.0780537Z     
2025-05-07T20:32:09.0780737Z         if contiguous:
2025-05-07T20:32:09.0780982Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0781604Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0781864Z     
2025-05-07T20:32:09.0782065Z         if scale_ub is not None:
2025-05-07T20:32:09.0782355Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0782711Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0783039Z             )
2025-05-07T20:32:09.0783239Z         else:
2025-05-07T20:32:09.0783464Z             scale_ub_tensor = None
2025-05-07T20:32:09.0783730Z     
2025-05-07T20:32:09.0783969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0784304Z             op = silu_mul_quant
2025-05-07T20:32:09.0784570Z             if compiled:
2025-05-07T20:32:09.0784827Z                 op = torch.compile(op)
2025-05-07T20:32:09.0785143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0785442Z     
2025-05-07T20:32:09.0785641Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0785823Z 
2025-05-07T20:32:09.0786124Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0786448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0786795Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0787097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0787830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0788556Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0789121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0789842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0790544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0791112Z     kernel = self.compile(
2025-05-07T20:32:09.0791681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0792385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0792809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0793049Z 
2025-05-07T20:32:09.0793272Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243b1d0>
2025-05-07T20:32:09.0794406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0795862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2131940>}
2025-05-07T20:32:09.0797266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0798341Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1a267f0>
2025-05-07T20:32:09.0798644Z 
2025-05-07T20:32:09.0798824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0799378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0799871Z                            module_map=module_map)
2025-05-07T20:32:09.0800354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0800725Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0800999Z E       ^
2025-05-07T20:32:09.0801485Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0801955Z 
2025-05-07T20:32:09.0802390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.0802935Z 
2025-05-07T20:32:09.0803133Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0803578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0804007Z     T=1,
2025-05-07T20:32:09.0804197Z     D=7168,
2025-05-07T20:32:09.0804403Z     scale_ub=None,
2025-05-07T20:32:09.0804628Z     contiguous=True,
2025-05-07T20:32:09.0804860Z     compiled=False,
2025-05-07T20:32:09.0805076Z )
2025-05-07T20:32:09.0805412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0805920Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:09.0806199Z 
2025-05-07T20:32:09.0806281Z     @given(
2025-05-07T20:32:09.0806526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0806853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0807177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0807611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0807967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0808269Z     )
2025-05-07T20:32:09.0808637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0809104Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0809357Z         self,
2025-05-07T20:32:09.0809562Z         T: int,
2025-05-07T20:32:09.0809775Z         D: int,
2025-05-07T20:32:09.0810001Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0810288Z         contiguous: bool,
2025-05-07T20:32:09.0810547Z         compiled: bool,
2025-05-07T20:32:09.0810782Z     ) -> None:
2025-05-07T20:32:09.0811014Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0811275Z     
2025-05-07T20:32:09.0811564Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0811928Z     
2025-05-07T20:32:09.0812139Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0812452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0812787Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0813043Z         x0 = x[:, :D]
2025-05-07T20:32:09.0813274Z         x1 = x[:, D:]
2025-05-07T20:32:09.0813800Z     
2025-05-07T20:32:09.0813995Z         if contiguous:
2025-05-07T20:32:09.0814239Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0814506Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0814760Z     
2025-05-07T20:32:09.0814961Z         if scale_ub is not None:
2025-05-07T20:32:09.0815242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0815596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0815921Z             )
2025-05-07T20:32:09.0816121Z         else:
2025-05-07T20:32:09.0816345Z             scale_ub_tensor = None
2025-05-07T20:32:09.0816610Z     
2025-05-07T20:32:09.0816847Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0817179Z             op = silu_mul_quant
2025-05-07T20:32:09.0817452Z             if compiled:
2025-05-07T20:32:09.0817719Z                 op = torch.compile(op)
2025-05-07T20:32:09.0818086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0818379Z     
2025-05-07T20:32:09.0818583Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0818755Z 
2025-05-07T20:32:09.0818859Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0819174Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0819527Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0819821Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0820547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0821268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0821838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0822560Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0823392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0823959Z     kernel = self.compile(
2025-05-07T20:32:09.0824527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0825217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0825641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0825888Z 
2025-05-07T20:32:09.0826112Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a500d0>
2025-05-07T20:32:09.0827238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0828798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2132ca0>}
2025-05-07T20:32:09.0830202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0831274Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1926670>
2025-05-07T20:32:09.0831578Z 
2025-05-07T20:32:09.0831762Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0832312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0832807Z                            module_map=module_map)
2025-05-07T20:32:09.0833191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0833562Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0833845Z E       ^
2025-05-07T20:32:09.0834345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0834817Z 
2025-05-07T20:32:09.0835262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.0835800Z 
2025-05-07T20:32:09.0835911Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.0836355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.0836781Z     T=16384,
2025-05-07T20:32:09.0836982Z     D=7168,
2025-05-07T20:32:09.0837189Z     scale_ub=1200.0,
2025-05-07T20:32:09.0837429Z     contiguous=False,
2025-05-07T20:32:09.0837665Z     compiled=True,
2025-05-07T20:32:09.0837881Z )
2025-05-07T20:32:09.0838218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.0838748Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.0839052Z 
2025-05-07T20:32:09.0839138Z     @given(
2025-05-07T20:32:09.0839386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.0839722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.0840047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.0840476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.0840828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.0841129Z     )
2025-05-07T20:32:09.0841501Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.0841968Z     def test_silu_mul_quant(
2025-05-07T20:32:09.0842220Z         self,
2025-05-07T20:32:09.0842428Z         T: int,
2025-05-07T20:32:09.0842637Z         D: int,
2025-05-07T20:32:09.0842869Z         scale_ub: Optional[float],
2025-05-07T20:32:09.0843152Z         contiguous: bool,
2025-05-07T20:32:09.0843406Z         compiled: bool,
2025-05-07T20:32:09.0843648Z     ) -> None:
2025-05-07T20:32:09.0843961Z         torch.manual_seed(2025)
2025-05-07T20:32:09.0844219Z     
2025-05-07T20:32:09.0844506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.0844860Z     
2025-05-07T20:32:09.0845067Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.0845381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.0845704Z         x = x_sign * x_clamp
2025-05-07T20:32:09.0845962Z         x0 = x[:, :D]
2025-05-07T20:32:09.0846199Z         x1 = x[:, D:]
2025-05-07T20:32:09.0846414Z     
2025-05-07T20:32:09.0846614Z         if contiguous:
2025-05-07T20:32:09.0846860Z             x0 = x0.contiguous()
2025-05-07T20:32:09.0847129Z             x1 = x1.contiguous()
2025-05-07T20:32:09.0847389Z     
2025-05-07T20:32:09.0847593Z         if scale_ub is not None:
2025-05-07T20:32:09.0847887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.0848238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.0848653Z             )
2025-05-07T20:32:09.0848871Z         else:
2025-05-07T20:32:09.0849094Z             scale_ub_tensor = None
2025-05-07T20:32:09.0849363Z     
2025-05-07T20:32:09.0849615Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.0849942Z             op = silu_mul_quant
2025-05-07T20:32:09.0850210Z             if compiled:
2025-05-07T20:32:09.0850473Z                 op = torch.compile(op)
2025-05-07T20:32:09.0850779Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0851069Z     
2025-05-07T20:32:09.0851273Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.0851445Z 
2025-05-07T20:32:09.0851549Z moe/activation_test.py:117: 
2025-05-07T20:32:09.0851861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0852210Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.0852510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.0853096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.0853688Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.0854380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.0855094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.0855657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.0856371Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.0857069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.0857621Z     kernel = self.compile(
2025-05-07T20:32:09.0858258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.0858950Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.0859388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.0859633Z 
2025-05-07T20:32:09.0859857Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f164e12d0>
2025-05-07T20:32:09.0860979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.0862412Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2133f60>}
2025-05-07T20:32:09.0863815Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.0864900Z context = <triton._C.libtriton.ir.context object at 0x7f1ef19a7bf0>
2025-05-07T20:32:09.0874077Z 
2025-05-07T20:32:09.0874300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.0874869Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.0875375Z                            module_map=module_map)
2025-05-07T20:32:09.0875771Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.0876146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.0876426Z E       ^
2025-05-07T20:32:09.0876930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.0877405Z 
2025-05-07T20:32:09.0877877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.2163862Z 
2025-05-07T20:32:09.2164141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.2164994Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.2165552Z     T=1,
2025-05-07T20:32:09.2165745Z     D=7168,
2025-05-07T20:32:09.2165958Z     scale_ub=None,
2025-05-07T20:32:09.2166192Z     contiguous=False,
2025-05-07T20:32:09.2166429Z     compiled=False,
2025-05-07T20:32:09.2166657Z )
2025-05-07T20:32:09.2166998Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.2167505Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:09.2167785Z 
2025-05-07T20:32:09.2167868Z     @given(
2025-05-07T20:32:09.2168117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.2168454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.2168776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.2169128Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.2169479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.2169785Z     )
2025-05-07T20:32:09.2170161Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.2170630Z     def test_silu_mul_quant(
2025-05-07T20:32:09.2170881Z         self,
2025-05-07T20:32:09.2171092Z         T: int,
2025-05-07T20:32:09.2171306Z         D: int,
2025-05-07T20:32:09.2171535Z         scale_ub: Optional[float],
2025-05-07T20:32:09.2171826Z         contiguous: bool,
2025-05-07T20:32:09.2172086Z         compiled: bool,
2025-05-07T20:32:09.2172326Z     ) -> None:
2025-05-07T20:32:09.2172562Z         torch.manual_seed(2025)
2025-05-07T20:32:09.2172825Z     
2025-05-07T20:32:09.2173111Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.2173477Z     
2025-05-07T20:32:09.2173690Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.2174004Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.2174328Z         x = x_sign * x_clamp
2025-05-07T20:32:09.2174598Z         x0 = x[:, :D]
2025-05-07T20:32:09.2174836Z         x1 = x[:, D:]
2025-05-07T20:32:09.2175057Z     
2025-05-07T20:32:09.2175265Z         if contiguous:
2025-05-07T20:32:09.2175510Z             x0 = x0.contiguous()
2025-05-07T20:32:09.2175780Z             x1 = x1.contiguous()
2025-05-07T20:32:09.2176044Z     
2025-05-07T20:32:09.2176258Z         if scale_ub is not None:
2025-05-07T20:32:09.2176540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.2176902Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.2177234Z             )
2025-05-07T20:32:09.2177468Z         else:
2025-05-07T20:32:09.2177701Z             scale_ub_tensor = None
2025-05-07T20:32:09.2177964Z     
2025-05-07T20:32:09.2178214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.2178547Z             op = silu_mul_quant
2025-05-07T20:32:09.2178809Z             if compiled:
2025-05-07T20:32:09.2179073Z                 op = torch.compile(op)
2025-05-07T20:32:09.2179398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.2179846Z     
2025-05-07T20:32:09.2180058Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.2180239Z 
2025-05-07T20:32:09.2180346Z moe/activation_test.py:117: 
2025-05-07T20:32:09.2180662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.2181014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.2181314Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.2182037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.2182751Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.2183321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.2184041Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.2184752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.2185386Z     kernel = self.compile(
2025-05-07T20:32:09.2185967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.2186662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.2187083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.2187334Z 
2025-05-07T20:32:09.2187555Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef22f6650>
2025-05-07T20:32:09.2188687Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.2190132Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef19909a0>}
2025-05-07T20:32:09.2191541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.2192602Z context = <triton._C.libtriton.ir.context object at 0x7f1ef16cff30>
2025-05-07T20:32:09.2192911Z 
2025-05-07T20:32:09.2193087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.2193641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.2194134Z                            module_map=module_map)
2025-05-07T20:32:09.2194514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.2194893Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.2195171Z E       ^
2025-05-07T20:32:09.2195656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.2196153Z 
2025-05-07T20:32:09.2196589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.2197132Z 
2025-05-07T20:32:09.2197243Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.2197679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.2198099Z     T=2048,
2025-05-07T20:32:09.2198307Z     D=7168,
2025-05-07T20:32:09.2198514Z     scale_ub=None,
2025-05-07T20:32:09.2198740Z     contiguous=False,
2025-05-07T20:32:09.2198988Z     compiled=True,
2025-05-07T20:32:09.2199209Z )
2025-05-07T20:32:09.2199539Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.2200063Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.2200453Z 
2025-05-07T20:32:09.2200545Z     @given(
2025-05-07T20:32:09.2200797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.2201208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.2201534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.2201886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.2202235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.2202540Z     )
2025-05-07T20:32:09.2202908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.2203365Z     def test_silu_mul_quant(
2025-05-07T20:32:09.2203624Z         self,
2025-05-07T20:32:09.2203834Z         T: int,
2025-05-07T20:32:09.2204036Z         D: int,
2025-05-07T20:32:09.2204269Z         scale_ub: Optional[float],
2025-05-07T20:32:09.2204556Z         contiguous: bool,
2025-05-07T20:32:09.2204805Z         compiled: bool,
2025-05-07T20:32:09.2205043Z     ) -> None:
2025-05-07T20:32:09.2205273Z         torch.manual_seed(2025)
2025-05-07T20:32:09.2205610Z     
2025-05-07T20:32:09.2205903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.2206265Z     
2025-05-07T20:32:09.2206476Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.2206781Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.2207110Z         x = x_sign * x_clamp
2025-05-07T20:32:09.2207375Z         x0 = x[:, :D]
2025-05-07T20:32:09.2207605Z         x1 = x[:, D:]
2025-05-07T20:32:09.2207833Z     
2025-05-07T20:32:09.2208035Z         if contiguous:
2025-05-07T20:32:09.2208276Z             x0 = x0.contiguous()
2025-05-07T20:32:09.2208556Z             x1 = x1.contiguous()
2025-05-07T20:32:09.2208818Z     
2025-05-07T20:32:09.2209018Z         if scale_ub is not None:
2025-05-07T20:32:09.2209309Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.2209668Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.2209991Z             )
2025-05-07T20:32:09.2210201Z         else:
2025-05-07T20:32:09.2210435Z             scale_ub_tensor = None
2025-05-07T20:32:09.2210699Z     
2025-05-07T20:32:09.2210954Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.2211295Z             op = silu_mul_quant
2025-05-07T20:32:09.2211566Z             if compiled:
2025-05-07T20:32:09.2211825Z                 op = torch.compile(op)
2025-05-07T20:32:09.2212142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.2212442Z     
2025-05-07T20:32:09.2212642Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.2212824Z 
2025-05-07T20:32:09.2212933Z moe/activation_test.py:117: 
2025-05-07T20:32:09.2213254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.2213879Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.2214197Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.2214867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.2215543Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.2216366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.2217214Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.2217863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.2218692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.2219502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.2220152Z     kernel = self.compile(
2025-05-07T20:32:09.2220797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.2221601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.2222073Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.2222354Z 
2025-05-07T20:32:09.2222784Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef2ab5c50>
2025-05-07T20:32:09.2223905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.2225335Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1992160>}
2025-05-07T20:32:09.2226733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.2227796Z context = <triton._C.libtriton.ir.context object at 0x7f1ef24c0230>
2025-05-07T20:32:09.2228103Z 
2025-05-07T20:32:09.2228429Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.2228982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.2229483Z                            module_map=module_map)
2025-05-07T20:32:09.2229870Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.2230243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.2230533Z E       ^
2025-05-07T20:32:09.2231025Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.2231495Z 
2025-05-07T20:32:09.2231938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.2232471Z 
2025-05-07T20:32:09.2232584Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.2233019Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.2233449Z     T=4096,
2025-05-07T20:32:09.2233641Z     D=7168,
2025-05-07T20:32:09.2233854Z     scale_ub=None,
2025-05-07T20:32:09.2234085Z     contiguous=False,
2025-05-07T20:32:09.2234322Z     compiled=True,
2025-05-07T20:32:09.4453859Z )
2025-05-07T20:32:09.4454371Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.4455121Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.4455513Z 
2025-05-07T20:32:09.4455642Z     @given(
2025-05-07T20:32:09.4455952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.4456383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.4456797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.4457208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.4457556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.4457859Z     )
2025-05-07T20:32:09.4458220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.4458716Z     def test_silu_mul_quant(
2025-05-07T20:32:09.4458973Z         self,
2025-05-07T20:32:09.4459175Z         T: int,
2025-05-07T20:32:09.4459385Z         D: int,
2025-05-07T20:32:09.4459615Z         scale_ub: Optional[float],
2025-05-07T20:32:09.4459902Z         contiguous: bool,
2025-05-07T20:32:09.4460150Z         compiled: bool,
2025-05-07T20:32:09.4460386Z     ) -> None:
2025-05-07T20:32:09.4460612Z         torch.manual_seed(2025)
2025-05-07T20:32:09.4460862Z     
2025-05-07T20:32:09.4461154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.4461513Z     
2025-05-07T20:32:09.4461714Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.4462024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.4462353Z         x = x_sign * x_clamp
2025-05-07T20:32:09.4462604Z         x0 = x[:, :D]
2025-05-07T20:32:09.4462835Z         x1 = x[:, D:]
2025-05-07T20:32:09.4463064Z     
2025-05-07T20:32:09.4463256Z         if contiguous:
2025-05-07T20:32:09.4463838Z             x0 = x0.contiguous()
2025-05-07T20:32:09.4464121Z             x1 = x1.contiguous()
2025-05-07T20:32:09.4464369Z     
2025-05-07T20:32:09.4464574Z         if scale_ub is not None:
2025-05-07T20:32:09.4464863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.4465212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.4465540Z             )
2025-05-07T20:32:09.4465748Z         else:
2025-05-07T20:32:09.4465978Z             scale_ub_tensor = None
2025-05-07T20:32:09.4466243Z     
2025-05-07T20:32:09.4466486Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4466816Z             op = silu_mul_quant
2025-05-07T20:32:09.4467072Z             if compiled:
2025-05-07T20:32:09.4467331Z                 op = torch.compile(op)
2025-05-07T20:32:09.4467643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4467956Z     
2025-05-07T20:32:09.4468333Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.4468507Z 
2025-05-07T20:32:09.4468627Z moe/activation_test.py:117: 
2025-05-07T20:32:09.4468934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4469287Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.4469585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4470175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.4470755Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.4471443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.4472160Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.4472719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.4473430Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.4474139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.4474694Z     kernel = self.compile(
2025-05-07T20:32:09.4475255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.4475940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4476586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4476828Z 
2025-05-07T20:32:09.4477051Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f164e1a50>
2025-05-07T20:32:09.4478177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.4479626Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1992e80>}
2025-05-07T20:32:09.4481122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.4482184Z context = <triton._C.libtriton.ir.context object at 0x7f1ef245d030>
2025-05-07T20:32:09.4482486Z 
2025-05-07T20:32:09.4482663Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.4483213Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4483705Z                            module_map=module_map)
2025-05-07T20:32:09.4484090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4484462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4484737Z E       ^
2025-05-07T20:32:09.4485323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4485796Z 
2025-05-07T20:32:09.4486231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.4486772Z 
2025-05-07T20:32:09.4486884Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.4487321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.4487746Z     T=16384,
2025-05-07T20:32:09.4487950Z     D=5120,
2025-05-07T20:32:09.4488196Z     scale_ub=1200.0,
2025-05-07T20:32:09.4488428Z     contiguous=False,
2025-05-07T20:32:09.4488669Z     compiled=False,
2025-05-07T20:32:09.4488887Z )
2025-05-07T20:32:09.4489219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.4489746Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:09.4490126Z 
2025-05-07T20:32:09.4490213Z     @given(
2025-05-07T20:32:09.4490459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.4490789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.4491114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.4491454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.4491800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.4492101Z     )
2025-05-07T20:32:09.4492469Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.4492930Z     def test_silu_mul_quant(
2025-05-07T20:32:09.4493191Z         self,
2025-05-07T20:32:09.4493402Z         T: int,
2025-05-07T20:32:09.4493605Z         D: int,
2025-05-07T20:32:09.4493837Z         scale_ub: Optional[float],
2025-05-07T20:32:09.4494125Z         contiguous: bool,
2025-05-07T20:32:09.4494374Z         compiled: bool,
2025-05-07T20:32:09.4494612Z     ) -> None:
2025-05-07T20:32:09.4494851Z         torch.manual_seed(2025)
2025-05-07T20:32:09.4495105Z     
2025-05-07T20:32:09.4495399Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.4495762Z     
2025-05-07T20:32:09.4495968Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.4496278Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.4496607Z         x = x_sign * x_clamp
2025-05-07T20:32:09.4496860Z         x0 = x[:, :D]
2025-05-07T20:32:09.4497084Z         x1 = x[:, D:]
2025-05-07T20:32:09.4497305Z     
2025-05-07T20:32:09.4497503Z         if contiguous:
2025-05-07T20:32:09.4497743Z             x0 = x0.contiguous()
2025-05-07T20:32:09.4498019Z             x1 = x1.contiguous()
2025-05-07T20:32:09.4498274Z     
2025-05-07T20:32:09.4498472Z         if scale_ub is not None:
2025-05-07T20:32:09.4498764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.4499122Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.4499454Z             )
2025-05-07T20:32:09.4499665Z         else:
2025-05-07T20:32:09.4499899Z             scale_ub_tensor = None
2025-05-07T20:32:09.4500164Z     
2025-05-07T20:32:09.4500416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4500752Z             op = silu_mul_quant
2025-05-07T20:32:09.4501012Z             if compiled:
2025-05-07T20:32:09.4501274Z                 op = torch.compile(op)
2025-05-07T20:32:09.4501591Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4501879Z     
2025-05-07T20:32:09.4502086Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.4502264Z 
2025-05-07T20:32:09.4502368Z moe/activation_test.py:117: 
2025-05-07T20:32:09.4502679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4503027Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.4503324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4504046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.4504848Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.4505417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.4506131Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.4506824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.4507373Z     kernel = self.compile(
2025-05-07T20:32:09.4507969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.4508681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4509098Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4509337Z 
2025-05-07T20:32:09.4509564Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a04050>
2025-05-07T20:32:09.4510766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.4512193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169c220>}
2025-05-07T20:32:09.4513857Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.4514926Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1527a70>
2025-05-07T20:32:09.4515228Z 
2025-05-07T20:32:09.4515404Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.4515954Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4516463Z                            module_map=module_map)
2025-05-07T20:32:09.4516850Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4517373Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4517666Z E       ^
2025-05-07T20:32:09.4518222Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4518781Z 
2025-05-07T20:32:09.4519295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.4519941Z 
2025-05-07T20:32:09.4520055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.4520596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.4521070Z     T=16384,
2025-05-07T20:32:09.4521274Z     D=5120,
2025-05-07T20:32:09.4521485Z     scale_ub=1200.0,
2025-05-07T20:32:09.4521733Z     contiguous=True,
2025-05-07T20:32:09.4521968Z     compiled=True,
2025-05-07T20:32:09.4522197Z )
2025-05-07T20:32:09.4522565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.4523148Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:09.4523488Z 
2025-05-07T20:32:09.4523571Z     @given(
2025-05-07T20:32:09.4523826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.4524184Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.4524527Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.4524904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.4525281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.4525603Z     )
2025-05-07T20:32:09.4526011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.4526537Z     def test_silu_mul_quant(
2025-05-07T20:32:09.4526828Z         self,
2025-05-07T20:32:09.4527034Z         T: int,
2025-05-07T20:32:09.4527424Z         D: int,
2025-05-07T20:32:09.4527664Z         scale_ub: Optional[float],
2025-05-07T20:32:09.4527956Z         contiguous: bool,
2025-05-07T20:32:09.4528204Z         compiled: bool,
2025-05-07T20:32:09.4528444Z     ) -> None:
2025-05-07T20:32:09.4528679Z         torch.manual_seed(2025)
2025-05-07T20:32:09.4528939Z     
2025-05-07T20:32:09.4529237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.4538304Z     
2025-05-07T20:32:09.4538560Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.4538886Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.4539216Z         x = x_sign * x_clamp
2025-05-07T20:32:09.4539481Z         x0 = x[:, :D]
2025-05-07T20:32:09.4539713Z         x1 = x[:, D:]
2025-05-07T20:32:09.4539935Z     
2025-05-07T20:32:09.4540139Z         if contiguous:
2025-05-07T20:32:09.4540389Z             x0 = x0.contiguous()
2025-05-07T20:32:09.4540840Z             x1 = x1.contiguous()
2025-05-07T20:32:09.4541096Z     
2025-05-07T20:32:09.4541315Z         if scale_ub is not None:
2025-05-07T20:32:09.4541601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.4541970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.4542308Z             )
2025-05-07T20:32:09.4542520Z         else:
2025-05-07T20:32:09.4542741Z             scale_ub_tensor = None
2025-05-07T20:32:09.4543014Z     
2025-05-07T20:32:09.4543268Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.4543602Z             op = silu_mul_quant
2025-05-07T20:32:09.4543865Z             if compiled:
2025-05-07T20:32:09.4544126Z                 op = torch.compile(op)
2025-05-07T20:32:09.4544432Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4544725Z     
2025-05-07T20:32:09.4544934Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.4545109Z 
2025-05-07T20:32:09.4545217Z moe/activation_test.py:117: 
2025-05-07T20:32:09.4545543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4545900Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.4546204Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.4546794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.4547383Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.4548075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.4548788Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.4549355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.4550070Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.4550774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.4551336Z     kernel = self.compile(
2025-05-07T20:32:09.4551909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.4552600Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.4553016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.4553272Z 
2025-05-07T20:32:09.4553488Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1aff5d0>
2025-05-07T20:32:09.4554613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.4556047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169d4e0>}
2025-05-07T20:32:09.4557541Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.4558602Z context = <triton._C.libtriton.ir.context object at 0x7f1ef16260b0>
2025-05-07T20:32:09.4558912Z 
2025-05-07T20:32:09.4559088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.4559637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.4560240Z                            module_map=module_map)
2025-05-07T20:32:09.4560625Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.4561002Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.4561279Z E       ^
2025-05-07T20:32:09.4561764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.4562323Z 
2025-05-07T20:32:09.4562765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.6073613Z 
2025-05-07T20:32:09.6074406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.6075642Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.6076525Z     T=16384,
2025-05-07T20:32:09.6076933Z     D=5120,
2025-05-07T20:32:09.6077321Z     scale_ub=None,
2025-05-07T20:32:09.6077768Z     contiguous=False,
2025-05-07T20:32:09.6078224Z     compiled=True,
2025-05-07T20:32:09.6078462Z )
2025-05-07T20:32:09.6078829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.6079364Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.6079661Z 
2025-05-07T20:32:09.6079743Z     @given(
2025-05-07T20:32:09.6079990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.6080458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.6080795Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.6081141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.6081490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.6081796Z     )
2025-05-07T20:32:09.6082166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.6082636Z     def test_silu_mul_quant(
2025-05-07T20:32:09.6082896Z         self,
2025-05-07T20:32:09.6083095Z         T: int,
2025-05-07T20:32:09.6083305Z         D: int,
2025-05-07T20:32:09.6083539Z         scale_ub: Optional[float],
2025-05-07T20:32:09.6083821Z         contiguous: bool,
2025-05-07T20:32:09.6084077Z         compiled: bool,
2025-05-07T20:32:09.6084322Z     ) -> None:
2025-05-07T20:32:09.6084545Z         torch.manual_seed(2025)
2025-05-07T20:32:09.6084804Z     
2025-05-07T20:32:09.6085092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.6085462Z     
2025-05-07T20:32:09.6085667Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.6085978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.6086307Z         x = x_sign * x_clamp
2025-05-07T20:32:09.6086553Z         x0 = x[:, :D]
2025-05-07T20:32:09.6086780Z         x1 = x[:, D:]
2025-05-07T20:32:09.6086996Z     
2025-05-07T20:32:09.6087184Z         if contiguous:
2025-05-07T20:32:09.6087429Z             x0 = x0.contiguous()
2025-05-07T20:32:09.6087704Z             x1 = x1.contiguous()
2025-05-07T20:32:09.6087953Z     
2025-05-07T20:32:09.6088155Z         if scale_ub is not None:
2025-05-07T20:32:09.6088446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.6088798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.6089130Z             )
2025-05-07T20:32:09.6089334Z         else:
2025-05-07T20:32:09.6089551Z             scale_ub_tensor = None
2025-05-07T20:32:09.6089825Z     
2025-05-07T20:32:09.6090429Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.6090762Z             op = silu_mul_quant
2025-05-07T20:32:09.6091039Z             if compiled:
2025-05-07T20:32:09.6091303Z                 op = torch.compile(op)
2025-05-07T20:32:09.6091621Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6091908Z     
2025-05-07T20:32:09.6092113Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.6092288Z 
2025-05-07T20:32:09.6092403Z moe/activation_test.py:117: 
2025-05-07T20:32:09.6092715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6093071Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.6093372Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6093955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.6094542Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.6095239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.6096118Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.6096676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.6097389Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.6098087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.6098652Z     kernel = self.compile(
2025-05-07T20:32:09.6099216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.6099906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6100329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6100577Z 
2025-05-07T20:32:09.6100801Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef24fd0d0>
2025-05-07T20:32:09.6101936Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.6103462Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169e2a0>}
2025-05-07T20:32:09.6104859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.6105921Z context = <triton._C.libtriton.ir.context object at 0x7f1ef164cdb0>
2025-05-07T20:32:09.6106223Z 
2025-05-07T20:32:09.6106406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.6106959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6107446Z                            module_map=module_map)
2025-05-07T20:32:09.6107834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6108198Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.6108472Z E       ^
2025-05-07T20:32:09.6108961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6109428Z 
2025-05-07T20:32:09.6109870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.6110401Z 
2025-05-07T20:32:09.6110509Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.6110945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.6111369Z     T=2048,
2025-05-07T20:32:09.6111568Z     D=5120,
2025-05-07T20:32:09.6111772Z     scale_ub=None,
2025-05-07T20:32:09.6112085Z     contiguous=False,
2025-05-07T20:32:09.6112321Z     compiled=True,
2025-05-07T20:32:09.6112537Z )
2025-05-07T20:32:09.6112873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.6113717Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:09.6114129Z 
2025-05-07T20:32:09.6114212Z     @given(
2025-05-07T20:32:09.6114455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.6114786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.6115102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.6115451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.6115802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.6116101Z     )
2025-05-07T20:32:09.6116468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.6117087Z     def test_silu_mul_quant(
2025-05-07T20:32:09.6117347Z         self,
2025-05-07T20:32:09.6117558Z         T: int,
2025-05-07T20:32:09.6117760Z         D: int,
2025-05-07T20:32:09.6118012Z         scale_ub: Optional[float],
2025-05-07T20:32:09.6118324Z         contiguous: bool,
2025-05-07T20:32:09.6118576Z         compiled: bool,
2025-05-07T20:32:09.6118804Z     ) -> None:
2025-05-07T20:32:09.6119029Z         torch.manual_seed(2025)
2025-05-07T20:32:09.6119282Z     
2025-05-07T20:32:09.6119560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.6119917Z     
2025-05-07T20:32:09.6120180Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.6120478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.6120803Z         x = x_sign * x_clamp
2025-05-07T20:32:09.6121053Z         x0 = x[:, :D]
2025-05-07T20:32:09.6121274Z         x1 = x[:, D:]
2025-05-07T20:32:09.6121493Z     
2025-05-07T20:32:09.6121687Z         if contiguous:
2025-05-07T20:32:09.6121931Z             x0 = x0.contiguous()
2025-05-07T20:32:09.6122203Z             x1 = x1.contiguous()
2025-05-07T20:32:09.6122452Z     
2025-05-07T20:32:09.6122646Z         if scale_ub is not None:
2025-05-07T20:32:09.6122932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.6123282Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.6123607Z             )
2025-05-07T20:32:09.6123803Z         else:
2025-05-07T20:32:09.6124025Z             scale_ub_tensor = None
2025-05-07T20:32:09.6124286Z     
2025-05-07T20:32:09.6124521Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.6124849Z             op = silu_mul_quant
2025-05-07T20:32:09.6125108Z             if compiled:
2025-05-07T20:32:09.6125360Z                 op = torch.compile(op)
2025-05-07T20:32:09.6125672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6125960Z     
2025-05-07T20:32:09.6126155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.6126337Z 
2025-05-07T20:32:09.6126442Z moe/activation_test.py:117: 
2025-05-07T20:32:09.6126750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6127092Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.6127387Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.6127971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.6128553Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.6129233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.6129943Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.6130501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.6131202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.6132024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.6132583Z     kernel = self.compile(
2025-05-07T20:32:09.6133148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.6133824Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.6134239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.6134484Z 
2025-05-07T20:32:09.6134701Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243bcd0>
2025-05-07T20:32:09.6135824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.6137249Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169f560>}
2025-05-07T20:32:09.6138775Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.6139841Z context = <triton._C.libtriton.ir.context object at 0x7f1ef15738f0>
2025-05-07T20:32:09.6140142Z 
2025-05-07T20:32:09.6140322Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.6140867Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.6141350Z                            module_map=module_map)
2025-05-07T20:32:09.6141730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.6142103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.6142371Z E       ^
2025-05-07T20:32:09.6142859Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.6143335Z 
2025-05-07T20:32:09.6143780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7715975Z 
2025-05-07T20:32:09.7716594Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.7717277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.7717858Z     T=2048,
2025-05-07T20:32:09.7718061Z     D=5120,
2025-05-07T20:32:09.7718265Z     scale_ub=1200.0,
2025-05-07T20:32:09.7718502Z     contiguous=False,
2025-05-07T20:32:09.7718743Z     compiled=True,
2025-05-07T20:32:09.7718962Z )
2025-05-07T20:32:09.7719304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.7719844Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.7720240Z 
2025-05-07T20:32:09.7720322Z     @given(
2025-05-07T20:32:09.7720592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.7720929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.7721259Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.7721608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.7721953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.7722257Z     )
2025-05-07T20:32:09.7722630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.7723098Z     def test_silu_mul_quant(
2025-05-07T20:32:09.7723347Z         self,
2025-05-07T20:32:09.7723555Z         T: int,
2025-05-07T20:32:09.7723765Z         D: int,
2025-05-07T20:32:09.7723988Z         scale_ub: Optional[float],
2025-05-07T20:32:09.7724277Z         contiguous: bool,
2025-05-07T20:32:09.7724531Z         compiled: bool,
2025-05-07T20:32:09.7724761Z     ) -> None:
2025-05-07T20:32:09.7724987Z         torch.manual_seed(2025)
2025-05-07T20:32:09.7725251Z     
2025-05-07T20:32:09.7725873Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.7726239Z     
2025-05-07T20:32:09.7726443Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.7726745Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.7727072Z         x = x_sign * x_clamp
2025-05-07T20:32:09.7727324Z         x0 = x[:, :D]
2025-05-07T20:32:09.7727546Z         x1 = x[:, D:]
2025-05-07T20:32:09.7727764Z     
2025-05-07T20:32:09.7727959Z         if contiguous:
2025-05-07T20:32:09.7728194Z             x0 = x0.contiguous()
2025-05-07T20:32:09.7728466Z             x1 = x1.contiguous()
2025-05-07T20:32:09.7728720Z     
2025-05-07T20:32:09.7728916Z         if scale_ub is not None:
2025-05-07T20:32:09.7729239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.7729588Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.7729913Z             )
2025-05-07T20:32:09.7730117Z         else:
2025-05-07T20:32:09.7730505Z             scale_ub_tensor = None
2025-05-07T20:32:09.7730769Z     
2025-05-07T20:32:09.7731019Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7731351Z             op = silu_mul_quant
2025-05-07T20:32:09.7731614Z             if compiled:
2025-05-07T20:32:09.7731877Z                 op = torch.compile(op)
2025-05-07T20:32:09.7732190Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7732473Z     
2025-05-07T20:32:09.7732676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.7732848Z 
2025-05-07T20:32:09.7732961Z moe/activation_test.py:117: 
2025-05-07T20:32:09.7733270Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7733626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.7733930Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7734519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.7735118Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.7735821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.7736548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.7737112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.7737836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.7738541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.7739107Z     kernel = self.compile(
2025-05-07T20:32:09.7739676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.7740369Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7740793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7741042Z 
2025-05-07T20:32:09.7741265Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef24fcad0>
2025-05-07T20:32:09.7742406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.7743871Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1534c20>}
2025-05-07T20:32:09.7745288Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.7746366Z context = <triton._C.libtriton.ir.context object at 0x7f1ef15cd370>
2025-05-07T20:32:09.7746674Z 
2025-05-07T20:32:09.7746849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.7747485Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7747985Z                            module_map=module_map)
2025-05-07T20:32:09.7748370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7748739Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.7749015Z E       ^
2025-05-07T20:32:09.7749511Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7749985Z 
2025-05-07T20:32:09.7750426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.7750969Z 
2025-05-07T20:32:09.7751078Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.7751514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.7752055Z     T=4096,
2025-05-07T20:32:09.7752249Z     D=5120,
2025-05-07T20:32:09.7752457Z     scale_ub=1200.0,
2025-05-07T20:32:09.7752693Z     contiguous=True,
2025-05-07T20:32:09.7752919Z     compiled=True,
2025-05-07T20:32:09.7753131Z )
2025-05-07T20:32:09.7753467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.7753982Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:09.7754274Z 
2025-05-07T20:32:09.7754357Z     @given(
2025-05-07T20:32:09.7754602Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.7754927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.7755251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.7755601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.7755947Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.7756242Z     )
2025-05-07T20:32:09.7756611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.7757088Z     def test_silu_mul_quant(
2025-05-07T20:32:09.7757337Z         self,
2025-05-07T20:32:09.7757541Z         T: int,
2025-05-07T20:32:09.7757756Z         D: int,
2025-05-07T20:32:09.7758002Z         scale_ub: Optional[float],
2025-05-07T20:32:09.7758319Z         contiguous: bool,
2025-05-07T20:32:09.7758575Z         compiled: bool,
2025-05-07T20:32:09.7758802Z     ) -> None:
2025-05-07T20:32:09.7759031Z         torch.manual_seed(2025)
2025-05-07T20:32:09.7759287Z     
2025-05-07T20:32:09.7759569Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.7759933Z     
2025-05-07T20:32:09.7760223Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.7760533Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.7760863Z         x = x_sign * x_clamp
2025-05-07T20:32:09.7761117Z         x0 = x[:, :D]
2025-05-07T20:32:09.7761348Z         x1 = x[:, D:]
2025-05-07T20:32:09.7761568Z     
2025-05-07T20:32:09.7761763Z         if contiguous:
2025-05-07T20:32:09.7762008Z             x0 = x0.contiguous()
2025-05-07T20:32:09.7762277Z             x1 = x1.contiguous()
2025-05-07T20:32:09.7762529Z     
2025-05-07T20:32:09.7762728Z         if scale_ub is not None:
2025-05-07T20:32:09.7763009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.7763363Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.7763686Z             )
2025-05-07T20:32:09.7763884Z         else:
2025-05-07T20:32:09.7764104Z             scale_ub_tensor = None
2025-05-07T20:32:09.7764370Z     
2025-05-07T20:32:09.7764607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.7764938Z             op = silu_mul_quant
2025-05-07T20:32:09.7765205Z             if compiled:
2025-05-07T20:32:09.7765466Z                 op = torch.compile(op)
2025-05-07T20:32:09.7765776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7766068Z     
2025-05-07T20:32:09.7766291Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.7766468Z 
2025-05-07T20:32:09.7766670Z moe/activation_test.py:117: 
2025-05-07T20:32:09.7766978Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7767336Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.7767634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.7768219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.7768815Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.7769514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.7770240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.7779412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.7780196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.7781034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.7781605Z     kernel = self.compile(
2025-05-07T20:32:09.7782179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.7782875Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.7783302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.7783546Z 
2025-05-07T20:32:09.7783765Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef2ab56d0>
2025-05-07T20:32:09.7784900Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.7786356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1535a80>}
2025-05-07T20:32:09.7787772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.7788847Z context = <triton._C.libtriton.ir.context object at 0x7f1ef14cd230>
2025-05-07T20:32:09.7789154Z 
2025-05-07T20:32:09.7789329Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.7789882Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.7790377Z                            module_map=module_map)
2025-05-07T20:32:09.7790765Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.7791138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.7791419Z E       ^
2025-05-07T20:32:09.7791923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.7792394Z 
2025-05-07T20:32:09.7792838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.9453402Z 
2025-05-07T20:32:09.9453770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.9454268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.9454692Z     T=128,
2025-05-07T20:32:09.9454898Z     D=5120,
2025-05-07T20:32:09.9455111Z     scale_ub=1200.0,
2025-05-07T20:32:09.9455347Z     contiguous=False,
2025-05-07T20:32:09.9455597Z     compiled=True,
2025-05-07T20:32:09.9455820Z )
2025-05-07T20:32:09.9456159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.9456682Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:09.9456998Z 
2025-05-07T20:32:09.9457081Z     @given(
2025-05-07T20:32:09.9457682Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.9458013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.9458345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.9458701Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.9459042Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.9459350Z     )
2025-05-07T20:32:09.9459725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.9460196Z     def test_silu_mul_quant(
2025-05-07T20:32:09.9460448Z         self,
2025-05-07T20:32:09.9460661Z         T: int,
2025-05-07T20:32:09.9460873Z         D: int,
2025-05-07T20:32:09.9461100Z         scale_ub: Optional[float],
2025-05-07T20:32:09.9461389Z         contiguous: bool,
2025-05-07T20:32:09.9461654Z         compiled: bool,
2025-05-07T20:32:09.9461893Z     ) -> None:
2025-05-07T20:32:09.9462292Z         torch.manual_seed(2025)
2025-05-07T20:32:09.9462555Z     
2025-05-07T20:32:09.9462846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.9463209Z     
2025-05-07T20:32:09.9463423Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.9463727Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.9464062Z         x = x_sign * x_clamp
2025-05-07T20:32:09.9464318Z         x0 = x[:, :D]
2025-05-07T20:32:09.9464545Z         x1 = x[:, D:]
2025-05-07T20:32:09.9464772Z     
2025-05-07T20:32:09.9464972Z         if contiguous:
2025-05-07T20:32:09.9465211Z             x0 = x0.contiguous()
2025-05-07T20:32:09.9465487Z             x1 = x1.contiguous()
2025-05-07T20:32:09.9465745Z     
2025-05-07T20:32:09.9465946Z         if scale_ub is not None:
2025-05-07T20:32:09.9466240Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.9466600Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.9466938Z             )
2025-05-07T20:32:09.9467140Z         else:
2025-05-07T20:32:09.9467378Z             scale_ub_tensor = None
2025-05-07T20:32:09.9467646Z     
2025-05-07T20:32:09.9467888Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.9468227Z             op = silu_mul_quant
2025-05-07T20:32:09.9468492Z             if compiled:
2025-05-07T20:32:09.9468751Z                 op = torch.compile(op)
2025-05-07T20:32:09.9469068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9469363Z     
2025-05-07T20:32:09.9469566Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.9469747Z 
2025-05-07T20:32:09.9469855Z moe/activation_test.py:117: 
2025-05-07T20:32:09.9470166Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9470543Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.9470851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9471438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.9472034Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.9472729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.9473441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.9474005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.9474715Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.9475408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.9475964Z     kernel = self.compile(
2025-05-07T20:32:09.9476530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.9477215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.9477724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9477977Z 
2025-05-07T20:32:09.9478229Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c7b50>
2025-05-07T20:32:09.9479372Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.9481053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1536ca0>}
2025-05-07T20:32:09.9482462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.9483532Z context = <triton._C.libtriton.ir.context object at 0x7f1ef14a32b0>
2025-05-07T20:32:09.9483928Z 
2025-05-07T20:32:09.9484114Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.9484674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.9485177Z                            module_map=module_map)
2025-05-07T20:32:09.9485566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.9485950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.9486235Z E       ^
2025-05-07T20:32:09.9486732Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.9487207Z 
2025-05-07T20:32:09.9487647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:09.9488198Z 
2025-05-07T20:32:09.9488311Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:09.9488763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:09.9489200Z     T=16384,
2025-05-07T20:32:09.9489408Z     D=7168,
2025-05-07T20:32:09.9489623Z     scale_ub=1200.0,
2025-05-07T20:32:09.9489861Z     contiguous=True,
2025-05-07T20:32:09.9490093Z     compiled=True,
2025-05-07T20:32:09.9490313Z )
2025-05-07T20:32:09.9490657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:09.9491180Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:09.9491484Z 
2025-05-07T20:32:09.9491568Z     @given(
2025-05-07T20:32:09.9491821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:09.9492157Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:09.9492491Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:09.9492846Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:09.9493201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:09.9493512Z     )
2025-05-07T20:32:09.9493892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:09.9494368Z     def test_silu_mul_quant(
2025-05-07T20:32:09.9494623Z         self,
2025-05-07T20:32:09.9494836Z         T: int,
2025-05-07T20:32:09.9495050Z         D: int,
2025-05-07T20:32:09.9495281Z         scale_ub: Optional[float],
2025-05-07T20:32:09.9495572Z         contiguous: bool,
2025-05-07T20:32:09.9495829Z         compiled: bool,
2025-05-07T20:32:09.9496060Z     ) -> None:
2025-05-07T20:32:09.9496288Z         torch.manual_seed(2025)
2025-05-07T20:32:09.9496547Z     
2025-05-07T20:32:09.9496833Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:09.9497193Z     
2025-05-07T20:32:09.9497401Z         x_sign = torch.sign(x)
2025-05-07T20:32:09.9497705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:09.9498038Z         x = x_sign * x_clamp
2025-05-07T20:32:09.9498294Z         x0 = x[:, :D]
2025-05-07T20:32:09.9498532Z         x1 = x[:, D:]
2025-05-07T20:32:09.9498750Z     
2025-05-07T20:32:09.9499047Z         if contiguous:
2025-05-07T20:32:09.9499297Z             x0 = x0.contiguous()
2025-05-07T20:32:09.9499569Z             x1 = x1.contiguous()
2025-05-07T20:32:09.9499827Z     
2025-05-07T20:32:09.9500035Z         if scale_ub is not None:
2025-05-07T20:32:09.9500325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:09.9500685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:09.9501015Z             )
2025-05-07T20:32:09.9501222Z         else:
2025-05-07T20:32:09.9501451Z             scale_ub_tensor = None
2025-05-07T20:32:09.9501722Z     
2025-05-07T20:32:09.9501966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:09.9502305Z             op = silu_mul_quant
2025-05-07T20:32:09.9502576Z             if compiled:
2025-05-07T20:32:09.9502835Z                 op = torch.compile(op)
2025-05-07T20:32:09.9503155Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9503572Z     
2025-05-07T20:32:09.9503782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:09.9503959Z 
2025-05-07T20:32:09.9504063Z moe/activation_test.py:117: 
2025-05-07T20:32:09.9504378Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9504731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:09.9505031Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:09.9505619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:09.9506202Z     return fn(*args, **kwargs)
2025-05-07T20:32:09.9506889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:09.9507608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:09.9508179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:09.9508902Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:09.9509603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:09.9510166Z     kernel = self.compile(
2025-05-07T20:32:09.9510730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:09.9511423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:09.9511842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:09.9512086Z 
2025-05-07T20:32:09.9512308Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef24fdfd0>
2025-05-07T20:32:09.9513700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:09.9515146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14c8400>}
2025-05-07T20:32:09.9516551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:09.9517620Z context = <triton._C.libtriton.ir.context object at 0x7f1ef143e770>
2025-05-07T20:32:09.9517925Z 
2025-05-07T20:32:09.9518115Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:09.9518663Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:09.9519171Z                            module_map=module_map)
2025-05-07T20:32:09.9519563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:09.9519938Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:09.9520283Z E       ^
2025-05-07T20:32:09.9520930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:09.9521407Z 
2025-05-07T20:32:09.9521854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0664094Z 
2025-05-07T20:32:10.0664643Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0665284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0665713Z     T=16384,
2025-05-07T20:32:10.0665914Z     D=5120,
2025-05-07T20:32:10.0666121Z     scale_ub=1200.0,
2025-05-07T20:32:10.0666356Z     contiguous=True,
2025-05-07T20:32:10.0666586Z     compiled=False,
2025-05-07T20:32:10.0666807Z )
2025-05-07T20:32:10.0667145Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0668038Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.0668352Z 
2025-05-07T20:32:10.0668432Z     @given(
2025-05-07T20:32:10.0668673Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0668999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0669315Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0669660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0670009Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0670304Z     )
2025-05-07T20:32:10.0670670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0671135Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0671383Z         self,
2025-05-07T20:32:10.0671592Z         T: int,
2025-05-07T20:32:10.0671797Z         D: int,
2025-05-07T20:32:10.0672020Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0672305Z         contiguous: bool,
2025-05-07T20:32:10.0672570Z         compiled: bool,
2025-05-07T20:32:10.0672812Z     ) -> None:
2025-05-07T20:32:10.0673037Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0673290Z     
2025-05-07T20:32:10.0673581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0673930Z     
2025-05-07T20:32:10.0674140Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0674477Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0674803Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0675055Z         x0 = x[:, :D]
2025-05-07T20:32:10.0675277Z         x1 = x[:, D:]
2025-05-07T20:32:10.0675496Z     
2025-05-07T20:32:10.0675691Z         if contiguous:
2025-05-07T20:32:10.0675929Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0676202Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0676457Z     
2025-05-07T20:32:10.0676655Z         if scale_ub is not None:
2025-05-07T20:32:10.0676943Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0677297Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0677623Z             )
2025-05-07T20:32:10.0677834Z         else:
2025-05-07T20:32:10.0678074Z             scale_ub_tensor = None
2025-05-07T20:32:10.0678372Z     
2025-05-07T20:32:10.0678618Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0678953Z             op = silu_mul_quant
2025-05-07T20:32:10.0679211Z             if compiled:
2025-05-07T20:32:10.0679473Z                 op = torch.compile(op)
2025-05-07T20:32:10.0679786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0680077Z     
2025-05-07T20:32:10.0680412Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0680592Z 
2025-05-07T20:32:10.0680698Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0681010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0681355Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0681652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0682525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0683240Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0683804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0684516Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0685208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0685756Z     kernel = self.compile(
2025-05-07T20:32:10.0686323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0687009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0687424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0687745Z 
2025-05-07T20:32:10.0687970Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c5550>
2025-05-07T20:32:10.0689093Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0690529Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14c8e00>}
2025-05-07T20:32:10.0691921Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0692978Z context = <triton._C.libtriton.ir.context object at 0x7f1ef13b7270>
2025-05-07T20:32:10.0693285Z 
2025-05-07T20:32:10.0693460Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0694017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0694504Z                            module_map=module_map)
2025-05-07T20:32:10.0694880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0695251Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0695524Z E       ^
2025-05-07T20:32:10.0696005Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0696479Z 
2025-05-07T20:32:10.0696911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0697450Z 
2025-05-07T20:32:10.0697561Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0697997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0698414Z     T=1,
2025-05-07T20:32:10.0698614Z     D=7168,
2025-05-07T20:32:10.0698821Z     scale_ub=1200.0,
2025-05-07T20:32:10.0699056Z     contiguous=False,
2025-05-07T20:32:10.0699293Z     compiled=False,
2025-05-07T20:32:10.0699509Z )
2025-05-07T20:32:10.0699840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.0700355Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.0700642Z 
2025-05-07T20:32:10.0700725Z     @given(
2025-05-07T20:32:10.0700971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.0701295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.0701622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.0701973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.0702315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.0702620Z     )
2025-05-07T20:32:10.0702995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.0703456Z     def test_silu_mul_quant(
2025-05-07T20:32:10.0703802Z         self,
2025-05-07T20:32:10.0704010Z         T: int,
2025-05-07T20:32:10.0704222Z         D: int,
2025-05-07T20:32:10.0704446Z         scale_ub: Optional[float],
2025-05-07T20:32:10.0704733Z         contiguous: bool,
2025-05-07T20:32:10.0704985Z         compiled: bool,
2025-05-07T20:32:10.0705213Z     ) -> None:
2025-05-07T20:32:10.0705439Z         torch.manual_seed(2025)
2025-05-07T20:32:10.0705693Z     
2025-05-07T20:32:10.0705971Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.0706327Z     
2025-05-07T20:32:10.0706528Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.0706825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.0707149Z         x = x_sign * x_clamp
2025-05-07T20:32:10.0707398Z         x0 = x[:, :D]
2025-05-07T20:32:10.0707617Z         x1 = x[:, D:]
2025-05-07T20:32:10.0707837Z     
2025-05-07T20:32:10.0708031Z         if contiguous:
2025-05-07T20:32:10.0708399Z             x0 = x0.contiguous()
2025-05-07T20:32:10.0708674Z             x1 = x1.contiguous()
2025-05-07T20:32:10.0708924Z     
2025-05-07T20:32:10.0709119Z         if scale_ub is not None:
2025-05-07T20:32:10.0709405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.0709755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.0710077Z             )
2025-05-07T20:32:10.0710275Z         else:
2025-05-07T20:32:10.0710495Z             scale_ub_tensor = None
2025-05-07T20:32:10.0710758Z     
2025-05-07T20:32:10.0710996Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.0711324Z             op = silu_mul_quant
2025-05-07T20:32:10.0711586Z             if compiled:
2025-05-07T20:32:10.0711840Z                 op = torch.compile(op)
2025-05-07T20:32:10.0712150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0712437Z     
2025-05-07T20:32:10.0712635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.0712821Z 
2025-05-07T20:32:10.0712926Z moe/activation_test.py:117: 
2025-05-07T20:32:10.0713237Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0713875Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.0714170Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.0714884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.0715598Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.0716152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.0716864Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.0717554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.0718112Z     kernel = self.compile(
2025-05-07T20:32:10.0718683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.0719368Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.0719781Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.0720020Z 
2025-05-07T20:32:10.0720325Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243a850>
2025-05-07T20:32:10.0721447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.0722881Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14ca160>}
2025-05-07T20:32:10.0724408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.0725470Z context = <triton._C.libtriton.ir.context object at 0x7f1ef13e47f0>
2025-05-07T20:32:10.0725775Z 
2025-05-07T20:32:10.0725947Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.0726491Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.0726980Z                            module_map=module_map)
2025-05-07T20:32:10.0727354Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.0727723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.0728001Z E       ^
2025-05-07T20:32:10.0728527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.0729000Z 
2025-05-07T20:32:10.0729430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.0730123Z 
2025-05-07T20:32:10.0730231Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.0730666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.0731078Z     T=4096,
2025-05-07T20:32:10.0731274Z     D=7168,
2025-05-07T20:32:10.0731474Z     scale_ub=1200.0,
2025-05-07T20:32:10.0731700Z     contiguous=False,
2025-05-07T20:32:10.0731937Z     compiled=True,
2025-05-07T20:32:10.2326995Z )
2025-05-07T20:32:10.2327407Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.2328049Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.2328513Z 
2025-05-07T20:32:10.2328613Z     @given(
2025-05-07T20:32:10.2328865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.2329198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.2329545Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.2329901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.2330250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.2339054Z     )
2025-05-07T20:32:10.2339479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.2339950Z     def test_silu_mul_quant(
2025-05-07T20:32:10.2340216Z         self,
2025-05-07T20:32:10.2340431Z         T: int,
2025-05-07T20:32:10.2340636Z         D: int,
2025-05-07T20:32:10.2340870Z         scale_ub: Optional[float],
2025-05-07T20:32:10.2341166Z         contiguous: bool,
2025-05-07T20:32:10.2341418Z         compiled: bool,
2025-05-07T20:32:10.2341666Z     ) -> None:
2025-05-07T20:32:10.2341901Z         torch.manual_seed(2025)
2025-05-07T20:32:10.2342158Z     
2025-05-07T20:32:10.2342457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.2342826Z     
2025-05-07T20:32:10.2343035Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.2343356Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.2343698Z         x = x_sign * x_clamp
2025-05-07T20:32:10.2343958Z         x0 = x[:, :D]
2025-05-07T20:32:10.2344185Z         x1 = x[:, D:]
2025-05-07T20:32:10.2344411Z     
2025-05-07T20:32:10.2344617Z         if contiguous:
2025-05-07T20:32:10.2344853Z             x0 = x0.contiguous()
2025-05-07T20:32:10.2345127Z             x1 = x1.contiguous()
2025-05-07T20:32:10.2345392Z     
2025-05-07T20:32:10.2345594Z         if scale_ub is not None:
2025-05-07T20:32:10.2345890Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.2346252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.2346575Z             )
2025-05-07T20:32:10.2346783Z         else:
2025-05-07T20:32:10.2347013Z             scale_ub_tensor = None
2025-05-07T20:32:10.2347276Z     
2025-05-07T20:32:10.2347526Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.2347870Z             op = silu_mul_quant
2025-05-07T20:32:10.2348481Z             if compiled:
2025-05-07T20:32:10.2348753Z                 op = torch.compile(op)
2025-05-07T20:32:10.2349071Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2349370Z     
2025-05-07T20:32:10.2349572Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.2349753Z 
2025-05-07T20:32:10.2349862Z moe/activation_test.py:117: 
2025-05-07T20:32:10.2350180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2350533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.2350842Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2351438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.2352025Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.2352722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.2353600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.2354181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.2354895Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.2355599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.2356164Z     kernel = self.compile(
2025-05-07T20:32:10.2356732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.2357431Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.2357862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2358109Z 
2025-05-07T20:32:10.2358337Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1762350>
2025-05-07T20:32:10.2359477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.2361047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14cb420>}
2025-05-07T20:32:10.2362457Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.2363529Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1336e70>
2025-05-07T20:32:10.2363836Z 
2025-05-07T20:32:10.2364023Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.2364574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.2365086Z                            module_map=module_map)
2025-05-07T20:32:10.2365483Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.2365859Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.2366141Z E       ^
2025-05-07T20:32:10.2366642Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.2367115Z 
2025-05-07T20:32:10.2367567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.2368105Z 
2025-05-07T20:32:10.2368220Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.2368670Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.2369104Z     T=128,
2025-05-07T20:32:10.2369302Z     D=7168,
2025-05-07T20:32:10.2369513Z     scale_ub=1200.0,
2025-05-07T20:32:10.2369759Z     contiguous=False,
2025-05-07T20:32:10.2370004Z     compiled=True,
2025-05-07T20:32:10.2370219Z )
2025-05-07T20:32:10.2370652Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.2371185Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:10.2371493Z 
2025-05-07T20:32:10.2371584Z     @given(
2025-05-07T20:32:10.2371831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.2372167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.2372499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.2372848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.2373202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.2373510Z     )
2025-05-07T20:32:10.2373875Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.2374340Z     def test_silu_mul_quant(
2025-05-07T20:32:10.2374600Z         self,
2025-05-07T20:32:10.2374915Z         T: int,
2025-05-07T20:32:10.2375120Z         D: int,
2025-05-07T20:32:10.2375362Z         scale_ub: Optional[float],
2025-05-07T20:32:10.2375649Z         contiguous: bool,
2025-05-07T20:32:10.2375897Z         compiled: bool,
2025-05-07T20:32:10.2376133Z     ) -> None:
2025-05-07T20:32:10.2376364Z         torch.manual_seed(2025)
2025-05-07T20:32:10.2376614Z     
2025-05-07T20:32:10.2376903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.2377265Z     
2025-05-07T20:32:10.2377466Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.2377783Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.2378117Z         x = x_sign * x_clamp
2025-05-07T20:32:10.2378372Z         x0 = x[:, :D]
2025-05-07T20:32:10.2378607Z         x1 = x[:, D:]
2025-05-07T20:32:10.2378841Z     
2025-05-07T20:32:10.2379035Z         if contiguous:
2025-05-07T20:32:10.2379287Z             x0 = x0.contiguous()
2025-05-07T20:32:10.2379564Z             x1 = x1.contiguous()
2025-05-07T20:32:10.2379823Z     
2025-05-07T20:32:10.2380036Z         if scale_ub is not None:
2025-05-07T20:32:10.2380329Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.2380691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.2381017Z             )
2025-05-07T20:32:10.2381230Z         else:
2025-05-07T20:32:10.2381461Z             scale_ub_tensor = None
2025-05-07T20:32:10.2381729Z     
2025-05-07T20:32:10.2381983Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.2382322Z             op = silu_mul_quant
2025-05-07T20:32:10.2382591Z             if compiled:
2025-05-07T20:32:10.2382863Z                 op = torch.compile(op)
2025-05-07T20:32:10.2383187Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2383474Z     
2025-05-07T20:32:10.2383680Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.2383854Z 
2025-05-07T20:32:10.2383969Z moe/activation_test.py:117: 
2025-05-07T20:32:10.2384278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2384641Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.2384945Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.2385536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.2386115Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.2386809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.2387533Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.2388097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.2388816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.2389517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.2390079Z     kernel = self.compile(
2025-05-07T20:32:10.2390735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.2391430Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.2391861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.2392107Z 
2025-05-07T20:32:10.2392332Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243b750>
2025-05-07T20:32:10.2393457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.2394883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2028720>}
2025-05-07T20:32:10.2396367Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.2397440Z context = <triton._C.libtriton.ir.context object at 0x7f1ef20c58b0>
2025-05-07T20:32:10.2397742Z 
2025-05-07T20:32:10.2397924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.2398474Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.2398972Z                            module_map=module_map)
2025-05-07T20:32:10.2399364Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.2399735Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.2400016Z E       ^
2025-05-07T20:32:10.2400580Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.2401055Z 
2025-05-07T20:32:10.2401502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.2402033Z 
2025-05-07T20:32:10.2402144Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.2402590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.2403012Z     T=2048,
2025-05-07T20:32:10.2403205Z     D=7168,
2025-05-07T20:32:10.2403411Z     scale_ub=None,
2025-05-07T20:32:10.2403641Z     contiguous=True,
2025-05-07T20:32:10.2403874Z     compiled=True,
2025-05-07T20:32:10.3669725Z )
2025-05-07T20:32:10.3670131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3670655Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.3670954Z 
2025-05-07T20:32:10.3671037Z     @given(
2025-05-07T20:32:10.3671281Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3671632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3671964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3672313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3672660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3672955Z     )
2025-05-07T20:32:10.3673329Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3673795Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3674045Z         self,
2025-05-07T20:32:10.3674253Z         T: int,
2025-05-07T20:32:10.3674471Z         D: int,
2025-05-07T20:32:10.3674695Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3674988Z         contiguous: bool,
2025-05-07T20:32:10.3675244Z         compiled: bool,
2025-05-07T20:32:10.3675475Z     ) -> None:
2025-05-07T20:32:10.3675704Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3675963Z     
2025-05-07T20:32:10.3676255Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3676618Z     
2025-05-07T20:32:10.3676829Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3677445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3677771Z         x = x_sign * x_clamp
2025-05-07T20:32:10.3678026Z         x0 = x[:, :D]
2025-05-07T20:32:10.3678275Z         x1 = x[:, D:]
2025-05-07T20:32:10.3678520Z     
2025-05-07T20:32:10.3678716Z         if contiguous:
2025-05-07T20:32:10.3678961Z             x0 = x0.contiguous()
2025-05-07T20:32:10.3679238Z             x1 = x1.contiguous()
2025-05-07T20:32:10.3679501Z     
2025-05-07T20:32:10.3679708Z         if scale_ub is not None:
2025-05-07T20:32:10.3679992Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.3680436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.3680765Z             )
2025-05-07T20:32:10.3680963Z         else:
2025-05-07T20:32:10.3681184Z             scale_ub_tensor = None
2025-05-07T20:32:10.3681451Z     
2025-05-07T20:32:10.3681881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.3682219Z             op = silu_mul_quant
2025-05-07T20:32:10.3682484Z             if compiled:
2025-05-07T20:32:10.3682745Z                 op = torch.compile(op)
2025-05-07T20:32:10.3683051Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3683343Z     
2025-05-07T20:32:10.3683546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.3683721Z 
2025-05-07T20:32:10.3683828Z moe/activation_test.py:117: 
2025-05-07T20:32:10.3684140Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3684491Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.3684784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.3685372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.3685958Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.3686676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.3687404Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.3687972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.3688680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.3689379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.3689944Z     kernel = self.compile(
2025-05-07T20:32:10.3690507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.3691201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.3691632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.3691873Z 
2025-05-07T20:32:10.3692103Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1fdcdd0>
2025-05-07T20:32:10.3693236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.3694688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2029440>}
2025-05-07T20:32:10.3696086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.3697158Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1738330>
2025-05-07T20:32:10.3697461Z 
2025-05-07T20:32:10.3697646Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.3698289Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.3698790Z                            module_map=module_map)
2025-05-07T20:32:10.3699179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.3699548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.3699826Z E       ^
2025-05-07T20:32:10.3700320Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.3700787Z 
2025-05-07T20:32:10.3701231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.3701763Z 
2025-05-07T20:32:10.3701873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3702313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3702741Z     T=16384,
2025-05-07T20:32:10.3702941Z     D=5120,
2025-05-07T20:32:10.3703226Z     scale_ub=None,
2025-05-07T20:32:10.3703453Z     contiguous=False,
2025-05-07T20:32:10.3703691Z     compiled=False,
2025-05-07T20:32:10.3703909Z )
2025-05-07T20:32:10.3704248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3704769Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:10.3705064Z 
2025-05-07T20:32:10.3705145Z     @given(
2025-05-07T20:32:10.3705389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3705720Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3706039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3706389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3706737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3707036Z     )
2025-05-07T20:32:10.3707403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3707868Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3708131Z         self,
2025-05-07T20:32:10.3708352Z         T: int,
2025-05-07T20:32:10.3708598Z         D: int,
2025-05-07T20:32:10.3708829Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3709111Z         contiguous: bool,
2025-05-07T20:32:10.3709367Z         compiled: bool,
2025-05-07T20:32:10.3709601Z     ) -> None:
2025-05-07T20:32:10.3709822Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3710077Z     
2025-05-07T20:32:10.3710365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3710721Z     
2025-05-07T20:32:10.3710936Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3711245Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3713607Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.3715555Z 
2025-05-07T20:32:10.3715687Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:10.3715912Z 
2025-05-07T20:32:10.3716036Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3716469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3716888Z     T=4096,
2025-05-07T20:32:10.3717088Z     D=7168,
2025-05-07T20:32:10.3717287Z     scale_ub=1200.0,
2025-05-07T20:32:10.3717523Z     contiguous=True,
2025-05-07T20:32:10.3717758Z     compiled=True,
2025-05-07T20:32:10.3717966Z )
2025-05-07T20:32:10.3718306Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3718825Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:10.3719116Z 
2025-05-07T20:32:10.3719325Z     @given(
2025-05-07T20:32:10.3719571Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3719901Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3720290Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3720631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3720976Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3721275Z     )
2025-05-07T20:32:10.3721641Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3722102Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3722355Z         self,
2025-05-07T20:32:10.3722555Z         T: int,
2025-05-07T20:32:10.3722764Z         D: int,
2025-05-07T20:32:10.3722996Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3723282Z         contiguous: bool,
2025-05-07T20:32:10.3723535Z         compiled: bool,
2025-05-07T20:32:10.3723892Z     ) -> None:
2025-05-07T20:32:10.3724118Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3724375Z     
2025-05-07T20:32:10.3724662Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3725015Z     
2025-05-07T20:32:10.3725225Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.3725533Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.3727607Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.3729537Z 
2025-05-07T20:32:10.3729673Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:10.3729895Z 
2025-05-07T20:32:10.3730004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.3730442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.3730860Z     T=16384,
2025-05-07T20:32:10.3731067Z     D=7168,
2025-05-07T20:32:10.3731267Z     scale_ub=None,
2025-05-07T20:32:10.3731494Z     contiguous=False,
2025-05-07T20:32:10.3731734Z     compiled=False,
2025-05-07T20:32:10.3731943Z )
2025-05-07T20:32:10.3732282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.3732805Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:10.3733098Z 
2025-05-07T20:32:10.3733180Z     @given(
2025-05-07T20:32:10.3733423Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.3733754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.3734076Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.3734429Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.3734780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.3735083Z     )
2025-05-07T20:32:10.3735445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.3735907Z     def test_silu_mul_quant(
2025-05-07T20:32:10.3736163Z         self,
2025-05-07T20:32:10.3736370Z         T: int,
2025-05-07T20:32:10.3736581Z         D: int,
2025-05-07T20:32:10.3736815Z         scale_ub: Optional[float],
2025-05-07T20:32:10.3737097Z         contiguous: bool,
2025-05-07T20:32:10.3737355Z         compiled: bool,
2025-05-07T20:32:10.3737589Z     ) -> None:
2025-05-07T20:32:10.3737811Z         torch.manual_seed(2025)
2025-05-07T20:32:10.3738071Z     
2025-05-07T20:32:10.3738400Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.3740677Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.3742609Z 
2025-05-07T20:32:10.3742739Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.4964183Z 
2025-05-07T20:32:10.4964575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.4965251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.4965801Z     T=2048,
2025-05-07T20:32:10.4966004Z     D=7168,
2025-05-07T20:32:10.4966219Z     scale_ub=1200.0,
2025-05-07T20:32:10.4966832Z     contiguous=True,
2025-05-07T20:32:10.4967075Z     compiled=True,
2025-05-07T20:32:10.4967309Z )
2025-05-07T20:32:10.4967646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.4968177Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:10.4968463Z 
2025-05-07T20:32:10.4968551Z     @given(
2025-05-07T20:32:10.4968791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.4969126Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.4969453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.4969800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.4970151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.4970462Z     )
2025-05-07T20:32:10.4970837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.4971299Z     def test_silu_mul_quant(
2025-05-07T20:32:10.4971564Z         self,
2025-05-07T20:32:10.4971773Z         T: int,
2025-05-07T20:32:10.4971980Z         D: int,
2025-05-07T20:32:10.4972218Z         scale_ub: Optional[float],
2025-05-07T20:32:10.4972542Z         contiguous: bool,
2025-05-07T20:32:10.4972795Z         compiled: bool,
2025-05-07T20:32:10.4973033Z     ) -> None:
2025-05-07T20:32:10.4973255Z         torch.manual_seed(2025)
2025-05-07T20:32:10.4973511Z     
2025-05-07T20:32:10.4973799Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.4974154Z     
2025-05-07T20:32:10.4974358Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.4974666Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.4976813Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.4978812Z 
2025-05-07T20:32:10.4978944Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:10.4979167Z 
2025-05-07T20:32:10.4979276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.4979722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.4980166Z     T=2048,
2025-05-07T20:32:10.4980371Z     D=7168,
2025-05-07T20:32:10.4980577Z     scale_ub=None,
2025-05-07T20:32:10.4980802Z     contiguous=True,
2025-05-07T20:32:10.4981041Z     compiled=False,
2025-05-07T20:32:10.4981260Z )
2025-05-07T20:32:10.4981593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.4982121Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.4991756Z 
2025-05-07T20:32:10.4992089Z     @given(
2025-05-07T20:32:10.4992363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.4992696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.4993026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.4993382Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.4993729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.4994039Z     )
2025-05-07T20:32:10.4994417Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.4994885Z     def test_silu_mul_quant(
2025-05-07T20:32:10.4995146Z         self,
2025-05-07T20:32:10.4995355Z         T: int,
2025-05-07T20:32:10.4995568Z         D: int,
2025-05-07T20:32:10.4995793Z         scale_ub: Optional[float],
2025-05-07T20:32:10.4996089Z         contiguous: bool,
2025-05-07T20:32:10.4996346Z         compiled: bool,
2025-05-07T20:32:10.4996667Z     ) -> None:
2025-05-07T20:32:10.4996898Z         torch.manual_seed(2025)
2025-05-07T20:32:10.4997164Z     
2025-05-07T20:32:10.4997451Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.4997818Z     
2025-05-07T20:32:10.4998034Z >       x_sign = torch.sign(x)
2025-05-07T20:32:10.5000058Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.5002130Z 
2025-05-07T20:32:10.5002257Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:10.5002495Z 
2025-05-07T20:32:10.5002610Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.5003045Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.5003466Z     T=1,
2025-05-07T20:32:10.5003656Z     D=7168,
2025-05-07T20:32:10.5003862Z     scale_ub=1200.0,
2025-05-07T20:32:10.5004100Z     contiguous=True,
2025-05-07T20:32:10.5004327Z     compiled=False,
2025-05-07T20:32:10.5004546Z )
2025-05-07T20:32:10.5004882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.5005387Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.5005669Z 
2025-05-07T20:32:10.5005749Z     @given(
2025-05-07T20:32:10.5005994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.5006324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.5006640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.5006988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.5007347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.5007644Z     )
2025-05-07T20:32:10.5008012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.5008472Z     def test_silu_mul_quant(
2025-05-07T20:32:10.5008723Z         self,
2025-05-07T20:32:10.5008930Z         T: int,
2025-05-07T20:32:10.5009137Z         D: int,
2025-05-07T20:32:10.5009364Z         scale_ub: Optional[float],
2025-05-07T20:32:10.5009655Z         contiguous: bool,
2025-05-07T20:32:10.5009912Z         compiled: bool,
2025-05-07T20:32:10.5010144Z     ) -> None:
2025-05-07T20:32:10.5010375Z         torch.manual_seed(2025)
2025-05-07T20:32:10.5010633Z     
2025-05-07T20:32:10.5010922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.5011281Z     
2025-05-07T20:32:10.5011491Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.5011801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.5012130Z         x = x_sign * x_clamp
2025-05-07T20:32:10.5012478Z         x0 = x[:, :D]
2025-05-07T20:32:10.5012717Z         x1 = x[:, D:]
2025-05-07T20:32:10.5012935Z     
2025-05-07T20:32:10.5013133Z         if contiguous:
2025-05-07T20:32:10.5013652Z             x0 = x0.contiguous()
2025-05-07T20:32:10.5013924Z             x1 = x1.contiguous()
2025-05-07T20:32:10.5014184Z     
2025-05-07T20:32:10.5014389Z         if scale_ub is not None:
2025-05-07T20:32:10.5014674Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.5015036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.5015367Z             )
2025-05-07T20:32:10.5015566Z         else:
2025-05-07T20:32:10.5015792Z             scale_ub_tensor = None
2025-05-07T20:32:10.5016057Z     
2025-05-07T20:32:10.5016298Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.5016634Z             op = silu_mul_quant
2025-05-07T20:32:10.5016902Z             if compiled:
2025-05-07T20:32:10.5017301Z                 op = torch.compile(op)
2025-05-07T20:32:10.5017617Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.5017911Z     
2025-05-07T20:32:10.5018125Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.5018298Z 
2025-05-07T20:32:10.5018426Z moe/activation_test.py:117: 
2025-05-07T20:32:10.5018766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.5019117Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.5019413Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.5020142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.5020869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.5021434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.5022143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.5022849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.5023408Z     kernel = self.compile(
2025-05-07T20:32:10.5023972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.5024661Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.5025081Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.5025318Z 
2025-05-07T20:32:10.5025541Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1761050>
2025-05-07T20:32:10.5026658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.5028096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c4400>}
2025-05-07T20:32:10.5029549Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.5030620Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1701db0>
2025-05-07T20:32:10.5030923Z 
2025-05-07T20:32:10.5031106Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.5031651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.5032144Z                            module_map=module_map)
2025-05-07T20:32:10.5032532Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.5032901Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.5033185Z E       ^
2025-05-07T20:32:10.5033823Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.5034298Z 
2025-05-07T20:32:10.5034740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.5035273Z 
2025-05-07T20:32:10.5035384Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.5035825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.5036249Z     T=128,
2025-05-07T20:32:10.5036442Z     D=5120,
2025-05-07T20:32:10.5036653Z     scale_ub=None,
2025-05-07T20:32:10.5036880Z     contiguous=True,
2025-05-07T20:32:10.5037115Z     compiled=False,
2025-05-07T20:32:10.5037336Z )
2025-05-07T20:32:10.5037676Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.5038196Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.5038564Z 
2025-05-07T20:32:10.5038644Z     @given(
2025-05-07T20:32:10.5038898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.5039230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.5039555Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.5039907Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.5040358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.5040656Z     )
2025-05-07T20:32:10.5041026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.5041489Z     def test_silu_mul_quant(
2025-05-07T20:32:10.5041746Z         self,
2025-05-07T20:32:10.5041948Z         T: int,
2025-05-07T20:32:10.5042159Z         D: int,
2025-05-07T20:32:10.5042397Z         scale_ub: Optional[float],
2025-05-07T20:32:10.5042677Z         contiguous: bool,
2025-05-07T20:32:10.5042932Z         compiled: bool,
2025-05-07T20:32:10.5043167Z     ) -> None:
2025-05-07T20:32:10.5043397Z         torch.manual_seed(2025)
2025-05-07T20:32:10.5043654Z     
2025-05-07T20:32:10.5043946Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.5044302Z     
2025-05-07T20:32:10.5044510Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.5044824Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.5045149Z         x = x_sign * x_clamp
2025-05-07T20:32:10.5045405Z         x0 = x[:, :D]
2025-05-07T20:32:10.5045636Z         x1 = x[:, D:]
2025-05-07T20:32:10.5045851Z     
2025-05-07T20:32:10.5046048Z         if contiguous:
2025-05-07T20:32:10.5046294Z             x0 = x0.contiguous()
2025-05-07T20:32:10.5046564Z             x1 = x1.contiguous()
2025-05-07T20:32:10.5046821Z     
2025-05-07T20:32:10.5047026Z         if scale_ub is not None:
2025-05-07T20:32:10.5047316Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.5047667Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.5048005Z             )
2025-05-07T20:32:10.5048215Z         else:
2025-05-07T20:32:10.5048442Z             scale_ub_tensor = None
2025-05-07T20:32:10.5048715Z     
2025-05-07T20:32:10.5048966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.5049297Z             op = silu_mul_quant
2025-05-07T20:32:10.5049567Z             if compiled:
2025-05-07T20:32:10.5049830Z                 op = torch.compile(op)
2025-05-07T20:32:10.5050137Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.5050429Z     
2025-05-07T20:32:10.5050632Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.5050813Z 
2025-05-07T20:32:10.5050914Z moe/activation_test.py:117: 
2025-05-07T20:32:10.5051223Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.5051573Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.5051865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.5052587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.5053407Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.5053965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.5054676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.5055369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.5055924Z     kernel = self.compile(
2025-05-07T20:32:10.5056485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.5057169Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.5057587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.5057827Z 
2025-05-07T20:32:10.5058042Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c61d0>
2025-05-07T20:32:10.5059340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.5060760Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c5300>}
2025-05-07T20:32:10.5062156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.5063219Z context = <triton._C.libtriton.ir.context object at 0x7f1ef104ef30>
2025-05-07T20:32:10.5063521Z 
2025-05-07T20:32:10.5063697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.5064249Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.5064749Z                            module_map=module_map)
2025-05-07T20:32:10.5065137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.5065514Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.5065791Z E       ^
2025-05-07T20:32:10.5066281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.5066754Z 
2025-05-07T20:32:10.5067187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6176470Z 
2025-05-07T20:32:10.6176816Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6177495Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6178096Z     T=128,
2025-05-07T20:32:10.6178363Z     D=7168,
2025-05-07T20:32:10.6178639Z     scale_ub=None,
2025-05-07T20:32:10.6178973Z     contiguous=True,
2025-05-07T20:32:10.6179280Z     compiled=False,
2025-05-07T20:32:10.6179519Z )
2025-05-07T20:32:10.6179855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6180366Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.6180652Z 
2025-05-07T20:32:10.6180735Z     @given(
2025-05-07T20:32:10.6180977Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6181313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6181632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6181985Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6182336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6182633Z     )
2025-05-07T20:32:10.6183003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6183468Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6183724Z         self,
2025-05-07T20:32:10.6183929Z         T: int,
2025-05-07T20:32:10.6185066Z         D: int,
2025-05-07T20:32:10.6185302Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6185588Z         contiguous: bool,
2025-05-07T20:32:10.6185849Z         compiled: bool,
2025-05-07T20:32:10.6186090Z     ) -> None:
2025-05-07T20:32:10.6186348Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6186599Z     
2025-05-07T20:32:10.6186887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6187251Z     
2025-05-07T20:32:10.6187450Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6187760Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6188089Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6188368Z         x0 = x[:, :D]
2025-05-07T20:32:10.6188617Z         x1 = x[:, D:]
2025-05-07T20:32:10.6188841Z     
2025-05-07T20:32:10.6189040Z         if contiguous:
2025-05-07T20:32:10.6189281Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6189716Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6189972Z     
2025-05-07T20:32:10.6190179Z         if scale_ub is not None:
2025-05-07T20:32:10.6190471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6190831Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6191156Z             )
2025-05-07T20:32:10.6191363Z         else:
2025-05-07T20:32:10.6191588Z             scale_ub_tensor = None
2025-05-07T20:32:10.6191849Z     
2025-05-07T20:32:10.6192096Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6192431Z             op = silu_mul_quant
2025-05-07T20:32:10.6192691Z             if compiled:
2025-05-07T20:32:10.6192952Z                 op = torch.compile(op)
2025-05-07T20:32:10.6193265Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6193557Z     
2025-05-07T20:32:10.6193754Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.6193933Z 
2025-05-07T20:32:10.6194040Z moe/activation_test.py:117: 
2025-05-07T20:32:10.6194365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6194710Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.6195007Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6195733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.6196445Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.6197012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6197730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6198433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6199025Z     kernel = self.compile(
2025-05-07T20:32:10.6199597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6200409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6200832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6201070Z 
2025-05-07T20:32:10.6201286Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c6650>
2025-05-07T20:32:10.6202406Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6203841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c60c0>}
2025-05-07T20:32:10.6205231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6206383Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1099530>
2025-05-07T20:32:10.6206695Z 
2025-05-07T20:32:10.6206871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6207418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6207909Z                            module_map=module_map)
2025-05-07T20:32:10.6208287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6208661Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6208936Z E       ^
2025-05-07T20:32:10.6209418Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6209893Z 
2025-05-07T20:32:10.6210325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.6210949Z 
2025-05-07T20:32:10.6211065Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6211506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6211921Z     T=2048,
2025-05-07T20:32:10.6212123Z     D=7168,
2025-05-07T20:32:10.6212333Z     scale_ub=1200.0,
2025-05-07T20:32:10.6212565Z     contiguous=True,
2025-05-07T20:32:10.6212802Z     compiled=False,
2025-05-07T20:32:10.6213021Z )
2025-05-07T20:32:10.6213606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6214134Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.6214427Z 
2025-05-07T20:32:10.6214507Z     @given(
2025-05-07T20:32:10.6214755Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6215077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6215401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6215750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6216106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6216410Z     )
2025-05-07T20:32:10.6216775Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6217230Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6217487Z         self,
2025-05-07T20:32:10.6217694Z         T: int,
2025-05-07T20:32:10.6217902Z         D: int,
2025-05-07T20:32:10.6218129Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6218415Z         contiguous: bool,
2025-05-07T20:32:10.6218668Z         compiled: bool,
2025-05-07T20:32:10.6218898Z     ) -> None:
2025-05-07T20:32:10.6219124Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6219384Z     
2025-05-07T20:32:10.6219668Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6221812Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.6223739Z 
2025-05-07T20:32:10.6223865Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.6224093Z 
2025-05-07T20:32:10.6224203Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.6224636Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.6225048Z     T=1,
2025-05-07T20:32:10.6225246Z     D=5120,
2025-05-07T20:32:10.6225448Z     scale_ub=1200.0,
2025-05-07T20:32:10.6225677Z     contiguous=True,
2025-05-07T20:32:10.6225910Z     compiled=False,
2025-05-07T20:32:10.6226137Z )
2025-05-07T20:32:10.6226597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.6227113Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.6227389Z 
2025-05-07T20:32:10.6227475Z     @given(
2025-05-07T20:32:10.6227714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.6228043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.6228372Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.6228721Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.6229065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.6229372Z     )
2025-05-07T20:32:10.6229747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.6230208Z     def test_silu_mul_quant(
2025-05-07T20:32:10.6230469Z         self,
2025-05-07T20:32:10.6230677Z         T: int,
2025-05-07T20:32:10.6230879Z         D: int,
2025-05-07T20:32:10.6231227Z         scale_ub: Optional[float],
2025-05-07T20:32:10.6231523Z         contiguous: bool,
2025-05-07T20:32:10.6231772Z         compiled: bool,
2025-05-07T20:32:10.6232010Z     ) -> None:
2025-05-07T20:32:10.6232240Z         torch.manual_seed(2025)
2025-05-07T20:32:10.6232490Z     
2025-05-07T20:32:10.6232779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.6233141Z     
2025-05-07T20:32:10.6233343Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.6233659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.6233988Z         x = x_sign * x_clamp
2025-05-07T20:32:10.6234243Z         x0 = x[:, :D]
2025-05-07T20:32:10.6234470Z         x1 = x[:, D:]
2025-05-07T20:32:10.6234692Z     
2025-05-07T20:32:10.6234888Z         if contiguous:
2025-05-07T20:32:10.6235129Z             x0 = x0.contiguous()
2025-05-07T20:32:10.6235405Z             x1 = x1.contiguous()
2025-05-07T20:32:10.6235662Z     
2025-05-07T20:32:10.6235867Z         if scale_ub is not None:
2025-05-07T20:32:10.6236157Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.6236523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.6236853Z             )
2025-05-07T20:32:10.6237055Z         else:
2025-05-07T20:32:10.6237283Z             scale_ub_tensor = None
2025-05-07T20:32:10.6237550Z     
2025-05-07T20:32:10.6237794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.6238131Z             op = silu_mul_quant
2025-05-07T20:32:10.6238397Z             if compiled:
2025-05-07T20:32:10.6238656Z                 op = torch.compile(op)
2025-05-07T20:32:10.6238976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6239270Z     
2025-05-07T20:32:10.6239470Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.6239648Z 
2025-05-07T20:32:10.6239752Z moe/activation_test.py:117: 
2025-05-07T20:32:10.6240064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6240514Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.6240812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.6241534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.6242256Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.6242816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.6243531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.6244226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.6244784Z     kernel = self.compile(
2025-05-07T20:32:10.6245348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.6246037Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.6246580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.6246822Z 
2025-05-07T20:32:10.6247047Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef20ffc50>
2025-05-07T20:32:10.6248174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.6249647Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c76a0>}
2025-05-07T20:32:10.6251060Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.6252125Z context = <triton._C.libtriton.ir.context object at 0x7f1ef0f867f0>
2025-05-07T20:32:10.6252505Z 
2025-05-07T20:32:10.6252696Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.6253248Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.6253738Z                            module_map=module_map)
2025-05-07T20:32:10.6254127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.6254504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.6254775Z E       ^
2025-05-07T20:32:10.6255264Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.6255734Z 
2025-05-07T20:32:10.6256177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.7078824Z 
2025-05-07T20:32:10.7079239Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7079851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7080604Z     T=2048,
2025-05-07T20:32:10.7080877Z     D=5120,
2025-05-07T20:32:10.7081161Z     scale_ub=None,
2025-05-07T20:32:10.7081383Z     contiguous=True,
2025-05-07T20:32:10.7081624Z     compiled=False,
2025-05-07T20:32:10.7081845Z )
2025-05-07T20:32:10.7082171Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7082686Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.7082966Z 
2025-05-07T20:32:10.7083056Z     @given(
2025-05-07T20:32:10.7083299Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7083618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7083939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7084286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7084626Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7084932Z     )
2025-05-07T20:32:10.7085305Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7085764Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7086057Z         self,
2025-05-07T20:32:10.7086258Z         T: int,
2025-05-07T20:32:10.7086465Z         D: int,
2025-05-07T20:32:10.7086697Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7086977Z         contiguous: bool,
2025-05-07T20:32:10.7087231Z         compiled: bool,
2025-05-07T20:32:10.7087472Z     ) -> None:
2025-05-07T20:32:10.7087694Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7087950Z     
2025-05-07T20:32:10.7088243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7088643Z     
2025-05-07T20:32:10.7088850Z >       x_sign = torch.sign(x)
2025-05-07T20:32:10.7091170Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7093128Z 
2025-05-07T20:32:10.7093251Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:10.7093473Z 
2025-05-07T20:32:10.7093588Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7094014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7094435Z     T=16384,
2025-05-07T20:32:10.7094640Z     D=5120,
2025-05-07T20:32:10.7094842Z     scale_ub=None,
2025-05-07T20:32:10.7095065Z     contiguous=True,
2025-05-07T20:32:10.7095299Z     compiled=False,
2025-05-07T20:32:10.7095515Z )
2025-05-07T20:32:10.7095845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7096546Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.7096837Z 
2025-05-07T20:32:10.7096926Z     @given(
2025-05-07T20:32:10.7097159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7097489Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7097825Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7098239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7098664Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7099038Z     )
2025-05-07T20:32:10.7099481Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7100031Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7100283Z         self,
2025-05-07T20:32:10.7100481Z         T: int,
2025-05-07T20:32:10.7100688Z         D: int,
2025-05-07T20:32:10.7100916Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7101206Z         contiguous: bool,
2025-05-07T20:32:10.7101455Z         compiled: bool,
2025-05-07T20:32:10.7101686Z     ) -> None:
2025-05-07T20:32:10.7101911Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7102159Z     
2025-05-07T20:32:10.7102449Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7104576Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7106502Z 
2025-05-07T20:32:10.7106639Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7106860Z 
2025-05-07T20:32:10.7106981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7107408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7107828Z     T=4096,
2025-05-07T20:32:10.7108028Z     D=5120,
2025-05-07T20:32:10.7108238Z     scale_ub=None,
2025-05-07T20:32:10.7108503Z     contiguous=True,
2025-05-07T20:32:10.7108742Z     compiled=False,
2025-05-07T20:32:10.7108951Z )
2025-05-07T20:32:10.7109285Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7109802Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.7110081Z 
2025-05-07T20:32:10.7110162Z     @given(
2025-05-07T20:32:10.7110406Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7110741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7111053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7111407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7111837Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7112147Z     )
2025-05-07T20:32:10.7112507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7112977Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7113231Z         self,
2025-05-07T20:32:10.7113774Z         T: int,
2025-05-07T20:32:10.7113984Z         D: int,
2025-05-07T20:32:10.7114215Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7114497Z         contiguous: bool,
2025-05-07T20:32:10.7114751Z         compiled: bool,
2025-05-07T20:32:10.7114985Z     ) -> None:
2025-05-07T20:32:10.7115204Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7115460Z     
2025-05-07T20:32:10.7115749Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7117890Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7120013Z 
2025-05-07T20:32:10.7120257Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7120479Z 
2025-05-07T20:32:10.7120589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7121021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7121438Z     T=2048,
2025-05-07T20:32:10.7121629Z     D=5120,
2025-05-07T20:32:10.7121826Z     scale_ub=None,
2025-05-07T20:32:10.7122049Z     contiguous=False,
2025-05-07T20:32:10.7122280Z     compiled=False,
2025-05-07T20:32:10.7122500Z )
2025-05-07T20:32:10.7122834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7123354Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:10.7123637Z 
2025-05-07T20:32:10.7123718Z     @given(
2025-05-07T20:32:10.7123957Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7124287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7124599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7124947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7125294Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7125591Z     )
2025-05-07T20:32:10.7125960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7126424Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7126670Z         self,
2025-05-07T20:32:10.7126876Z         T: int,
2025-05-07T20:32:10.7127092Z         D: int,
2025-05-07T20:32:10.7127323Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7127606Z         contiguous: bool,
2025-05-07T20:32:10.7127864Z         compiled: bool,
2025-05-07T20:32:10.7128101Z     ) -> None:
2025-05-07T20:32:10.7128319Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7128600Z     
2025-05-07T20:32:10.7128911Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7131017Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7132936Z 
2025-05-07T20:32:10.7133186Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7133416Z 
2025-05-07T20:32:10.7133525Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7133962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7134385Z     T=4096,
2025-05-07T20:32:10.7134579Z     D=7168,
2025-05-07T20:32:10.7134782Z     scale_ub=None,
2025-05-07T20:32:10.7135007Z     contiguous=True,
2025-05-07T20:32:10.7135235Z     compiled=True,
2025-05-07T20:32:10.7135450Z )
2025-05-07T20:32:10.7135781Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7136286Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:10.7136571Z 
2025-05-07T20:32:10.7136651Z     @given(
2025-05-07T20:32:10.7136892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7137214Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7137622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7137973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7138321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7138616Z     )
2025-05-07T20:32:10.7138978Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7139447Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7139700Z         self,
2025-05-07T20:32:10.7139896Z         T: int,
2025-05-07T20:32:10.7140099Z         D: int,
2025-05-07T20:32:10.7140326Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7140607Z         contiguous: bool,
2025-05-07T20:32:10.7140861Z         compiled: bool,
2025-05-07T20:32:10.7141093Z     ) -> None:
2025-05-07T20:32:10.7141319Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7141565Z     
2025-05-07T20:32:10.7141850Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7143966Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7145890Z 
2025-05-07T20:32:10.7146018Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7146237Z 
2025-05-07T20:32:10.7146345Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7146777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7147192Z     T=2048,
2025-05-07T20:32:10.7147385Z     D=5120,
2025-05-07T20:32:10.7147577Z     scale_ub=1200.0,
2025-05-07T20:32:10.7147813Z     contiguous=False,
2025-05-07T20:32:10.7148048Z     compiled=False,
2025-05-07T20:32:10.7691125Z )
2025-05-07T20:32:10.7691521Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7692253Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.7692541Z 
2025-05-07T20:32:10.7692630Z     @given(
2025-05-07T20:32:10.7692867Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7693195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7693520Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7693862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7694209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7694518Z     )
2025-05-07T20:32:10.7694884Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7695345Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7695617Z         self,
2025-05-07T20:32:10.7695817Z         T: int,
2025-05-07T20:32:10.7696281Z         D: int,
2025-05-07T20:32:10.7696514Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7696796Z         contiguous: bool,
2025-05-07T20:32:10.7697043Z         compiled: bool,
2025-05-07T20:32:10.7697277Z     ) -> None:
2025-05-07T20:32:10.7697502Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7697750Z     
2025-05-07T20:32:10.7698037Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7700163Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7702216Z 
2025-05-07T20:32:10.7702346Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7702568Z 
2025-05-07T20:32:10.7702674Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7703110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7703530Z     T=4096,
2025-05-07T20:32:10.7703732Z     D=7168,
2025-05-07T20:32:10.7703928Z     scale_ub=1200.0,
2025-05-07T20:32:10.7704160Z     contiguous=True,
2025-05-07T20:32:10.7704393Z     compiled=False,
2025-05-07T20:32:10.7704607Z )
2025-05-07T20:32:10.7704940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7705459Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.7705745Z 
2025-05-07T20:32:10.7705825Z     @given(
2025-05-07T20:32:10.7706065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7706401Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7706724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7707068Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7707415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7707717Z     )
2025-05-07T20:32:10.7708075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7708594Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7708851Z         self,
2025-05-07T20:32:10.7709049Z         T: int,
2025-05-07T20:32:10.7709256Z         D: int,
2025-05-07T20:32:10.7709490Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7709771Z         contiguous: bool,
2025-05-07T20:32:10.7710022Z         compiled: bool,
2025-05-07T20:32:10.7710255Z     ) -> None:
2025-05-07T20:32:10.7710473Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7710726Z     
2025-05-07T20:32:10.7711012Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7713138Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7715344Z 
2025-05-07T20:32:10.7715472Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7715724Z 
2025-05-07T20:32:10.7715834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7716268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7716682Z     T=16384,
2025-05-07T20:32:10.7716894Z     D=7168,
2025-05-07T20:32:10.7717093Z     scale_ub=None,
2025-05-07T20:32:10.7717438Z     contiguous=False,
2025-05-07T20:32:10.7717677Z     compiled=True,
2025-05-07T20:32:10.7717888Z )
2025-05-07T20:32:10.7718214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7718738Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:10.7719030Z 
2025-05-07T20:32:10.7719118Z     @given(
2025-05-07T20:32:10.7719352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7719682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7720002Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7720435Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7720772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7721071Z     )
2025-05-07T20:32:10.7721438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7722038Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7722297Z         self,
2025-05-07T20:32:10.7722514Z         T: int,
2025-05-07T20:32:10.7722717Z         D: int,
2025-05-07T20:32:10.7722950Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7723236Z         contiguous: bool,
2025-05-07T20:32:10.7723485Z         compiled: bool,
2025-05-07T20:32:10.7723721Z     ) -> None:
2025-05-07T20:32:10.7723949Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7724198Z     
2025-05-07T20:32:10.7724486Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7726613Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7728544Z 
2025-05-07T20:32:10.7728666Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7728887Z 
2025-05-07T20:32:10.7729005Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7729431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7729855Z     T=4096,
2025-05-07T20:32:10.7730052Z     D=7168,
2025-05-07T20:32:10.7730247Z     scale_ub=None,
2025-05-07T20:32:10.7730469Z     contiguous=True,
2025-05-07T20:32:10.7730701Z     compiled=False,
2025-05-07T20:32:10.7730911Z )
2025-05-07T20:32:10.7731241Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7731756Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.7732035Z 
2025-05-07T20:32:10.7732121Z     @given(
2025-05-07T20:32:10.7732359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7732695Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7733013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7733350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7733691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7733990Z     )
2025-05-07T20:32:10.7734348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7734808Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7735064Z         self,
2025-05-07T20:32:10.7735271Z         T: int,
2025-05-07T20:32:10.7735473Z         D: int,
2025-05-07T20:32:10.7735699Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7735982Z         contiguous: bool,
2025-05-07T20:32:10.7736227Z         compiled: bool,
2025-05-07T20:32:10.7736460Z     ) -> None:
2025-05-07T20:32:10.7736684Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7736936Z     
2025-05-07T20:32:10.7737307Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7739470Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7741386Z 
2025-05-07T20:32:10.7741516Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7741735Z 
2025-05-07T20:32:10.7741847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7742271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7742769Z     T=16384,
2025-05-07T20:32:10.7742979Z     D=7168,
2025-05-07T20:32:10.7743175Z     scale_ub=None,
2025-05-07T20:32:10.7743402Z     contiguous=True,
2025-05-07T20:32:10.7743638Z     compiled=False,
2025-05-07T20:32:10.7743851Z )
2025-05-07T20:32:10.7744183Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7744701Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:10.7744992Z 
2025-05-07T20:32:10.7745073Z     @given(
2025-05-07T20:32:10.7745317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7745645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7745965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7746305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7746651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7746952Z     )
2025-05-07T20:32:10.7747312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7747790Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7748052Z         self,
2025-05-07T20:32:10.7748253Z         T: int,
2025-05-07T20:32:10.7748460Z         D: int,
2025-05-07T20:32:10.7748686Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7748964Z         contiguous: bool,
2025-05-07T20:32:10.7749213Z         compiled: bool,
2025-05-07T20:32:10.7749445Z     ) -> None:
2025-05-07T20:32:10.7749663Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7749916Z     
2025-05-07T20:32:10.7750203Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7752323Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7754239Z 
2025-05-07T20:32:10.7754367Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.7754585Z 
2025-05-07T20:32:10.7754692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.7755123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.7755543Z     T=16384,
2025-05-07T20:32:10.7755739Z     D=7168,
2025-05-07T20:32:10.7755940Z     scale_ub=1200.0,
2025-05-07T20:32:10.7756173Z     contiguous=True,
2025-05-07T20:32:10.7756399Z     compiled=False,
2025-05-07T20:32:10.7756618Z )
2025-05-07T20:32:10.7756950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.7757469Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:10.7757764Z 
2025-05-07T20:32:10.7757856Z     @given(
2025-05-07T20:32:10.7758178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.7758511Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.7758832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.7759178Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.7759517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.7759819Z     )
2025-05-07T20:32:10.7760283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.7760742Z     def test_silu_mul_quant(
2025-05-07T20:32:10.7760994Z         self,
2025-05-07T20:32:10.7761199Z         T: int,
2025-05-07T20:32:10.7761401Z         D: int,
2025-05-07T20:32:10.7761631Z         scale_ub: Optional[float],
2025-05-07T20:32:10.7761916Z         contiguous: bool,
2025-05-07T20:32:10.7762165Z         compiled: bool,
2025-05-07T20:32:10.7762479Z     ) -> None:
2025-05-07T20:32:10.7762708Z         torch.manual_seed(2025)
2025-05-07T20:32:10.7762963Z     
2025-05-07T20:32:10.7763248Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.7765366Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.7767281Z 
2025-05-07T20:32:10.7767403Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.9563301Z 
2025-05-07T20:32:10.9563985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9564709Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9565225Z     T=128,
2025-05-07T20:32:10.9565420Z     D=5120,
2025-05-07T20:32:10.9565620Z     scale_ub=1200.0,
2025-05-07T20:32:10.9565849Z     contiguous=False,
2025-05-07T20:32:10.9566086Z     compiled=False,
2025-05-07T20:32:10.9566307Z )
2025-05-07T20:32:10.9566636Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9567171Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:10.9567469Z 
2025-05-07T20:32:10.9567549Z     @given(
2025-05-07T20:32:10.9567789Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9568118Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9568467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9568835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9569187Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9569490Z     )
2025-05-07T20:32:10.9569901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9580018Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9580294Z         self,
2025-05-07T20:32:10.9580498Z         T: int,
2025-05-07T20:32:10.9580713Z         D: int,
2025-05-07T20:32:10.9580947Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9581238Z         contiguous: bool,
2025-05-07T20:32:10.9581499Z         compiled: bool,
2025-05-07T20:32:10.9581745Z     ) -> None:
2025-05-07T20:32:10.9581968Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9582233Z     
2025-05-07T20:32:10.9582527Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9582886Z     
2025-05-07T20:32:10.9583088Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9583393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9583715Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9583984Z         x0 = x[:, :D]
2025-05-07T20:32:10.9584220Z         x1 = x[:, D:]
2025-05-07T20:32:10.9584778Z     
2025-05-07T20:32:10.9584971Z         if contiguous:
2025-05-07T20:32:10.9585216Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9585491Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9585741Z     
2025-05-07T20:32:10.9585946Z         if scale_ub is not None:
2025-05-07T20:32:10.9586234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9586583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9586911Z             )
2025-05-07T20:32:10.9587117Z         else:
2025-05-07T20:32:10.9587335Z             scale_ub_tensor = None
2025-05-07T20:32:10.9587598Z     
2025-05-07T20:32:10.9587845Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9588175Z             op = silu_mul_quant
2025-05-07T20:32:10.9588439Z             if compiled:
2025-05-07T20:32:10.9588742Z                 op = torch.compile(op)
2025-05-07T20:32:10.9589208Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9589504Z     
2025-05-07T20:32:10.9589715Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.9589889Z 
2025-05-07T20:32:10.9589999Z moe/activation_test.py:117: 
2025-05-07T20:32:10.9590303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9590660Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.9590956Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9591672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.9592398Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.9592965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9593689Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9594385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9594954Z     kernel = self.compile(
2025-05-07T20:32:10.9595525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9596206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9596626Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9596872Z 
2025-05-07T20:32:10.9597088Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef20feed0>
2025-05-07T20:32:10.9598236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9599790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef0f39bc0>}
2025-05-07T20:32:10.9601325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9602391Z context = <triton._C.libtriton.ir.context object at 0x7f1ef0e804b0>
2025-05-07T20:32:10.9602702Z 
2025-05-07T20:32:10.9602877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9603431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9603916Z                            module_map=module_map)
2025-05-07T20:32:10.9604304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9604682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.9604957Z E       ^
2025-05-07T20:32:10.9605441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9605925Z 
2025-05-07T20:32:10.9606443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:10.9606977Z 
2025-05-07T20:32:10.9607095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9607531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9607947Z     T=2048,
2025-05-07T20:32:10.9608145Z     D=7168,
2025-05-07T20:32:10.9608350Z     scale_ub=None,
2025-05-07T20:32:10.9608572Z     contiguous=False,
2025-05-07T20:32:10.9608812Z     compiled=False,
2025-05-07T20:32:10.9609031Z )
2025-05-07T20:32:10.9609361Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9609880Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:10.9610164Z 
2025-05-07T20:32:10.9610252Z     @given(
2025-05-07T20:32:10.9610488Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9610909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9611232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9611579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9611920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9612222Z     )
2025-05-07T20:32:10.9612589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9613047Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9613754Z         self,
2025-05-07T20:32:10.9614136Z         T: int,
2025-05-07T20:32:10.9614419Z         D: int,
2025-05-07T20:32:10.9614744Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9615149Z         contiguous: bool,
2025-05-07T20:32:10.9615493Z         compiled: bool,
2025-05-07T20:32:10.9615823Z     ) -> None:
2025-05-07T20:32:10.9616138Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9616497Z     
2025-05-07T20:32:10.9616906Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9619485Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:10.9621413Z 
2025-05-07T20:32:10.9621540Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:10.9621768Z 
2025-05-07T20:32:10.9621875Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:10.9622310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:10.9622729Z     T=128,
2025-05-07T20:32:10.9622925Z     D=7168,
2025-05-07T20:32:10.9623131Z     scale_ub=1200.0,
2025-05-07T20:32:10.9623360Z     contiguous=True,
2025-05-07T20:32:10.9623591Z     compiled=True,
2025-05-07T20:32:10.9623804Z )
2025-05-07T20:32:10.9624131Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:10.9624645Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:10.9624922Z 
2025-05-07T20:32:10.9625013Z     @given(
2025-05-07T20:32:10.9625258Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:10.9625580Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:10.9625901Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:10.9626245Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:10.9626584Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:10.9626885Z     )
2025-05-07T20:32:10.9627253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:10.9627715Z     def test_silu_mul_quant(
2025-05-07T20:32:10.9628127Z         self,
2025-05-07T20:32:10.9628336Z         T: int,
2025-05-07T20:32:10.9628536Z         D: int,
2025-05-07T20:32:10.9628767Z         scale_ub: Optional[float],
2025-05-07T20:32:10.9629056Z         contiguous: bool,
2025-05-07T20:32:10.9629302Z         compiled: bool,
2025-05-07T20:32:10.9629538Z     ) -> None:
2025-05-07T20:32:10.9629763Z         torch.manual_seed(2025)
2025-05-07T20:32:10.9630009Z     
2025-05-07T20:32:10.9630299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:10.9630656Z     
2025-05-07T20:32:10.9630869Z         x_sign = torch.sign(x)
2025-05-07T20:32:10.9631169Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:10.9631495Z         x = x_sign * x_clamp
2025-05-07T20:32:10.9631747Z         x0 = x[:, :D]
2025-05-07T20:32:10.9631970Z         x1 = x[:, D:]
2025-05-07T20:32:10.9632189Z     
2025-05-07T20:32:10.9632509Z         if contiguous:
2025-05-07T20:32:10.9632745Z             x0 = x0.contiguous()
2025-05-07T20:32:10.9633023Z             x1 = x1.contiguous()
2025-05-07T20:32:10.9633275Z     
2025-05-07T20:32:10.9633469Z         if scale_ub is not None:
2025-05-07T20:32:10.9633756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:10.9634109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:10.9634425Z             )
2025-05-07T20:32:10.9634628Z         else:
2025-05-07T20:32:10.9634846Z             scale_ub_tensor = None
2025-05-07T20:32:10.9635104Z     
2025-05-07T20:32:10.9635348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:10.9635678Z             op = silu_mul_quant
2025-05-07T20:32:10.9635940Z             if compiled:
2025-05-07T20:32:10.9636194Z                 op = torch.compile(op)
2025-05-07T20:32:10.9636507Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9636798Z     
2025-05-07T20:32:10.9636996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:10.9637184Z 
2025-05-07T20:32:10.9637291Z moe/activation_test.py:117: 
2025-05-07T20:32:10.9637604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9637950Z moe/activation_test.py:115: in fn
2025-05-07T20:32:10.9638247Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:10.9638831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:10.9639416Z     return fn(*args, **kwargs)
2025-05-07T20:32:10.9640096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:10.9640880Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:10.9641441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:10.9642152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:10.9642854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:10.9643407Z     kernel = self.compile(
2025-05-07T20:32:10.9643965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:10.9644648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:10.9645059Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:10.9645296Z 
2025-05-07T20:32:10.9645510Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1fddf50>
2025-05-07T20:32:10.9646624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:10.9648130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef0e2c2c0>}
2025-05-07T20:32:10.9649529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:10.9650583Z context = <triton._C.libtriton.ir.context object at 0x7f1ef0ec7b70>
2025-05-07T20:32:10.9650883Z 
2025-05-07T20:32:10.9651057Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:10.9651602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:10.9652084Z                            module_map=module_map)
2025-05-07T20:32:10.9652469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:10.9652833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:10.9653101Z E       ^
2025-05-07T20:32:10.9653696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:10.9654161Z 
2025-05-07T20:32:10.9654590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5609964Z 
2025-05-07T20:32:11.5610465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5610957Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5611377Z     T=128,
2025-05-07T20:32:11.5611574Z     D=7168,
2025-05-07T20:32:11.5611775Z     scale_ub=1200.0,
2025-05-07T20:32:11.5612002Z     contiguous=True,
2025-05-07T20:32:11.5612236Z     compiled=False,
2025-05-07T20:32:11.5612455Z )
2025-05-07T20:32:11.5612785Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5613572Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.5613991Z 
2025-05-07T20:32:11.5614105Z     @given(
2025-05-07T20:32:11.5614355Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5614674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5614996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5615338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5615674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5615972Z     )
2025-05-07T20:32:11.5616338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5616792Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5617045Z         self,
2025-05-07T20:32:11.5617245Z         T: int,
2025-05-07T20:32:11.5617443Z         D: int,
2025-05-07T20:32:11.5617673Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5617959Z         contiguous: bool,
2025-05-07T20:32:11.5618210Z         compiled: bool,
2025-05-07T20:32:11.5618452Z     ) -> None:
2025-05-07T20:32:11.5618718Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5618978Z     
2025-05-07T20:32:11.5619259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5619614Z     
2025-05-07T20:32:11.5619818Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5620118Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5622204Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5624150Z 
2025-05-07T20:32:11.5624272Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.5624500Z 
2025-05-07T20:32:11.5624883Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5625321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5625730Z     T=128,
2025-05-07T20:32:11.5625925Z     D=5120,
2025-05-07T20:32:11.5626126Z     scale_ub=1200.0,
2025-05-07T20:32:11.5626349Z     contiguous=True,
2025-05-07T20:32:11.5626575Z     compiled=True,
2025-05-07T20:32:11.5626786Z )
2025-05-07T20:32:11.5627129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5627632Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.5627914Z 
2025-05-07T20:32:11.5627993Z     @given(
2025-05-07T20:32:11.5628229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5628552Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5628864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5629372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5629724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5630015Z     )
2025-05-07T20:32:11.5630375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5630831Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5631074Z         self,
2025-05-07T20:32:11.5631276Z         T: int,
2025-05-07T20:32:11.5631482Z         D: int,
2025-05-07T20:32:11.5631703Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5631985Z         contiguous: bool,
2025-05-07T20:32:11.5632236Z         compiled: bool,
2025-05-07T20:32:11.5632460Z     ) -> None:
2025-05-07T20:32:11.5632687Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5632939Z     
2025-05-07T20:32:11.5633212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5633570Z     
2025-05-07T20:32:11.5633774Z >       x_sign = torch.sign(x)
2025-05-07T20:32:11.5635794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5637711Z 
2025-05-07T20:32:11.5637839Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:11.5638058Z 
2025-05-07T20:32:11.5638166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5638596Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5639014Z     T=128,
2025-05-07T20:32:11.5639203Z     D=7168,
2025-05-07T20:32:11.5639401Z     scale_ub=None,
2025-05-07T20:32:11.5639629Z     contiguous=True,
2025-05-07T20:32:11.5639861Z     compiled=True,
2025-05-07T20:32:11.5640067Z )
2025-05-07T20:32:11.5640513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5641025Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.5641297Z 
2025-05-07T20:32:11.5641376Z     @given(
2025-05-07T20:32:11.5641617Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5641944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5642257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5642597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5642936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5643225Z     )
2025-05-07T20:32:11.5643585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5644043Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5644292Z         self,
2025-05-07T20:32:11.5644493Z         T: int,
2025-05-07T20:32:11.5644694Z         D: int,
2025-05-07T20:32:11.5645009Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5645305Z         contiguous: bool,
2025-05-07T20:32:11.5645565Z         compiled: bool,
2025-05-07T20:32:11.5645805Z     ) -> None:
2025-05-07T20:32:11.5646030Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5646297Z     
2025-05-07T20:32:11.5646593Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5649207Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5651683Z 
2025-05-07T20:32:11.5651818Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.5652061Z 
2025-05-07T20:32:11.5701633Z FAILED
2025-05-07T20:32:11.5701836Z 
2025-05-07T20:32:11.5702208Z =================================== FAILURES ===================================
2025-05-07T20:32:11.5702684Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:11.5703333Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:11.5704222Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:11.5705024Z   |     yield
2025-05-07T20:32:11.5705648Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:11.5706395Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:11.5706796Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:11.5707607Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:11.5708398Z   |     if method() is not None:
2025-05-07T20:32:11.5708752Z   |        ~~~~~~^^
2025-05-07T20:32:11.5709682Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:11.5710730Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5711152Z   |            ^^^^^^^
2025-05-07T20:32:11.5711958Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:11.5712867Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:11.5713722Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:11.5714327Z   +-+---------------- 1 ----------------
2025-05-07T20:32:11.5714760Z     | Traceback (most recent call last):
2025-05-07T20:32:11.5715790Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:11.5716918Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5719862Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5722732Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:11.5723386Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5723814Z     |     T=128,
2025-05-07T20:32:11.5724026Z     |     D=7168,
2025-05-07T20:32:11.5724250Z     |     scale_ub=1200.0,
2025-05-07T20:32:11.5724503Z     |     contiguous=True,
2025-05-07T20:32:11.5724760Z     |     compiled=False,
2025-05-07T20:32:11.5725001Z     | )
2025-05-07T20:32:11.5725186Z     | 
2025-05-07T20:32:11.5725739Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:11.5726368Z     +---------------- 2 ----------------
2025-05-07T20:32:11.5726666Z     | Traceback (most recent call last):
2025-05-07T20:32:11.5727411Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:11.5728216Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5730502Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5732526Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:11.5732980Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5733403Z     |     T=128,
2025-05-07T20:32:11.5733613Z     |     D=7168,
2025-05-07T20:32:11.5733825Z     |     scale_ub=None,
2025-05-07T20:32:11.5734076Z     |     contiguous=True,
2025-05-07T20:32:11.5734333Z     |     compiled=True,
2025-05-07T20:32:11.5734564Z     | )
2025-05-07T20:32:11.5734752Z     | 
2025-05-07T20:32:11.5735297Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:11.5735920Z     +---------------- 3 ----------------
2025-05-07T20:32:11.5736260Z     | Traceback (most recent call last):
2025-05-07T20:32:11.5737397Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:11.5738552Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5756667Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.5759575Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:11.5760337Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5760929Z     |     T=128,
2025-05-07T20:32:11.5761221Z     |     D=5120,
2025-05-07T20:32:11.5761519Z     |     scale_ub=1200.0,
2025-05-07T20:32:11.5761859Z     |     contiguous=True,
2025-05-07T20:32:11.5762209Z     |     compiled=True,
2025-05-07T20:32:11.5762540Z     | )
2025-05-07T20:32:11.5762790Z     | 
2025-05-07T20:32:11.5763547Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:11.5764437Z     +---------------- 4 ----------------
2025-05-07T20:32:11.5764932Z     | Traceback (most recent call last):
2025-05-07T20:32:11.5765986Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:11.5767033Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.5767451Z     |                              ~~~~~~^^
2025-05-07T20:32:11.5768381Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:11.5769448Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5770605Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:11.5771431Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.5771725Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:11.5772140Z     |         a,
2025-05-07T20:32:11.5772356Z     |         ^^
2025-05-07T20:32:11.5772566Z     |     ...<23 lines>...
2025-05-07T20:32:11.5772824Z     |         USE_INT64=use_int64,
2025-05-07T20:32:11.5773099Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.5773349Z     |     )
2025-05-07T20:32:11.5773547Z     |     ^
2025-05-07T20:32:11.5774095Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:11.5774862Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5775326Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.5776001Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:11.5776810Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5777308Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.5777981Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:11.5778704Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.5779103Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:11.5779734Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:11.5780321Z     |     fn()
2025-05-07T20:32:11.5780531Z     |     ~~^^
2025-05-07T20:32:11.5781124Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:11.5781771Z     |     self.fn.run(
2025-05-07T20:32:11.5782011Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:11.5782240Z     |         *args,
2025-05-07T20:32:11.5782459Z     |         ^^^^^^
2025-05-07T20:32:11.5782683Z     |         **current,
2025-05-07T20:32:11.5782922Z     |         ^^^^^^^^^^
2025-05-07T20:32:11.5783147Z     |     )
2025-05-07T20:32:11.5783348Z     |     ^
2025-05-07T20:32:11.5783872Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:11.5784468Z     |     kernel = self.compile(
2025-05-07T20:32:11.5784735Z     |         src,
2025-05-07T20:32:11.5784958Z     |         target=target,
2025-05-07T20:32:11.5785223Z     |         options=options.__dict__,
2025-05-07T20:32:11.5785507Z     |     )
2025-05-07T20:32:11.5786074Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:11.5786809Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5787629Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:11.5788458Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5788947Z     |                        module_map=module_map)
2025-05-07T20:32:11.5789332Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5789699Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.5789975Z     | ^
2025-05-07T20:32:11.5790459Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5791044Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:11.5791464Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:11.5792002Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5793111Z     |     T=1,  # or any other generated value
2025-05-07T20:32:11.5793442Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:11.5793801Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:11.5794177Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:11.5794548Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:11.5794860Z     | )
2025-05-07T20:32:11.5795052Z     | 
2025-05-07T20:32:11.5795592Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:11.5796226Z     +------------------------------------
2025-05-07T20:32:11.5796602Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:11.5796991Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5797428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5797861Z     T=1,
2025-05-07T20:32:11.5798059Z     D=5120,
2025-05-07T20:32:11.5798262Z     scale_ub=None,
2025-05-07T20:32:11.5798492Z     contiguous=True,
2025-05-07T20:32:11.5798731Z     compiled=True,
2025-05-07T20:32:11.5798940Z )
2025-05-07T20:32:11.5799280Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5799787Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.5800058Z 
2025-05-07T20:32:11.5800234Z     @given(
2025-05-07T20:32:11.5800480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5800809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5801127Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5801476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5801825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5802129Z     )
2025-05-07T20:32:11.5802491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5802963Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5803222Z         self,
2025-05-07T20:32:11.5803423Z         T: int,
2025-05-07T20:32:11.5803633Z         D: int,
2025-05-07T20:32:11.5803865Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5804146Z         contiguous: bool,
2025-05-07T20:32:11.5804400Z         compiled: bool,
2025-05-07T20:32:11.5804636Z     ) -> None:
2025-05-07T20:32:11.5804857Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5805117Z     
2025-05-07T20:32:11.5805405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5805756Z     
2025-05-07T20:32:11.5805964Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5806272Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5806598Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5806850Z         x0 = x[:, :D]
2025-05-07T20:32:11.5807077Z         x1 = x[:, D:]
2025-05-07T20:32:11.5807297Z     
2025-05-07T20:32:11.5807492Z         if contiguous:
2025-05-07T20:32:11.5807732Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5808094Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5808340Z     
2025-05-07T20:32:11.5808567Z         if scale_ub is not None:
2025-05-07T20:32:11.5808880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5809226Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5809556Z             )
2025-05-07T20:32:11.5809760Z         else:
2025-05-07T20:32:11.5809977Z             scale_ub_tensor = None
2025-05-07T20:32:11.5810246Z     
2025-05-07T20:32:11.5810494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5810823Z             op = silu_mul_quant
2025-05-07T20:32:11.5811095Z             if compiled:
2025-05-07T20:32:11.5811369Z                 op = torch.compile(op)
2025-05-07T20:32:11.5811677Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5811970Z     
2025-05-07T20:32:11.5812180Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.5812582Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.5812886Z     
2025-05-07T20:32:11.5813140Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5813820Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.5814125Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.5814451Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.5814827Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5815210Z     
2025-05-07T20:32:11.5815487Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.5815782Z 
2025-05-07T20:32:11.5815927Z moe/activation_test.py:126: 
2025-05-07T20:32:11.5816361Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5816843Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.5817317Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5818473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.5819618Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.5820414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5821413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5822421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.5823475Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5824498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.5825362Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.5826183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.5826889Z     fn()
2025-05-07T20:32:11.5827596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.5828437Z     self.fn.run(
2025-05-07T20:32:11.5829077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5829826Z     kernel = self.compile(
2025-05-07T20:32:11.5830603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5831574Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5832161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5832506Z 
2025-05-07T20:32:11.5832806Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f29a18cd0>
2025-05-07T20:32:11.5834654Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5836700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f17492f20>}
2025-05-07T20:32:11.5838666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5840248Z context = <triton._C.libtriton.ir.context object at 0x7f1f2a0bf5f0>
2025-05-07T20:32:11.5840683Z 
2025-05-07T20:32:11.5840926Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5841703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5842524Z                            module_map=module_map)
2025-05-07T20:32:11.5843056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5843570Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.5843957Z E       ^
2025-05-07T20:32:11.5844635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5845296Z 
2025-05-07T20:32:11.5845909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5846667Z 
2025-05-07T20:32:11.5846820Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5847430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5848005Z     T=2048,
2025-05-07T20:32:11.5848278Z     D=5120,
2025-05-07T20:32:11.5848585Z     scale_ub=1200.0,
2025-05-07T20:32:11.5848928Z     contiguous=True,
2025-05-07T20:32:11.5849269Z     compiled=False,
2025-05-07T20:32:11.5849577Z )
2025-05-07T20:32:11.5850044Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5850766Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.5851175Z 
2025-05-07T20:32:11.5851289Z     @given(
2025-05-07T20:32:11.5851614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5852044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5852468Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5852927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5853374Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5853768Z     )
2025-05-07T20:32:11.5854269Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5854906Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5855258Z         self,
2025-05-07T20:32:11.5855552Z         T: int,
2025-05-07T20:32:11.5855834Z         D: int,
2025-05-07T20:32:11.5856154Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5856502Z         contiguous: bool,
2025-05-07T20:32:11.5856851Z         compiled: bool,
2025-05-07T20:32:11.5857163Z     ) -> None:
2025-05-07T20:32:11.5857436Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5857749Z     
2025-05-07T20:32:11.5858091Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5858544Z     
2025-05-07T20:32:11.5858799Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5859166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5859567Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5859895Z         x0 = x[:, :D]
2025-05-07T20:32:11.5860184Z         x1 = x[:, D:]
2025-05-07T20:32:11.5860451Z     
2025-05-07T20:32:11.5860687Z         if contiguous:
2025-05-07T20:32:11.5860979Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5861351Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5861667Z     
2025-05-07T20:32:11.5861926Z         if scale_ub is not None:
2025-05-07T20:32:11.5862623Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5863070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5863460Z             )
2025-05-07T20:32:11.5863726Z         else:
2025-05-07T20:32:11.5863996Z             scale_ub_tensor = None
2025-05-07T20:32:11.5864331Z     
2025-05-07T20:32:11.5864642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5865058Z             op = silu_mul_quant
2025-05-07T20:32:11.5865408Z             if compiled:
2025-05-07T20:32:11.5865745Z                 op = torch.compile(op)
2025-05-07T20:32:11.5866147Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5866535Z     
2025-05-07T20:32:11.5866799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.5867043Z 
2025-05-07T20:32:11.5867176Z moe/activation_test.py:117: 
2025-05-07T20:32:11.5867607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5868162Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.5868553Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5869508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.5870505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.5871275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5872271Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5873239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5874012Z     kernel = self.compile(
2025-05-07T20:32:11.5874809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5875782Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5876376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5876716Z 
2025-05-07T20:32:11.5877016Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2897e450>
2025-05-07T20:32:11.5878593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5880741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f289bfec0>}
2025-05-07T20:32:11.5882621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5884045Z context = <triton._C.libtriton.ir.context object at 0x7f1f28833eb0>
2025-05-07T20:32:11.5884450Z 
2025-05-07T20:32:11.5884661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5885366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5886027Z                            module_map=module_map)
2025-05-07T20:32:11.5886554Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5887072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.5887457Z E       ^
2025-05-07T20:32:11.5888147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5888813Z 
2025-05-07T20:32:11.5889428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5890197Z 
2025-05-07T20:32:11.5890360Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5891084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5891680Z     T=2048,
2025-05-07T20:32:11.5891948Z     D=5120,
2025-05-07T20:32:11.5892218Z     scale_ub=1200.0,
2025-05-07T20:32:11.5892525Z     contiguous=True,
2025-05-07T20:32:11.5892803Z     compiled=True,
2025-05-07T20:32:11.5893065Z )
2025-05-07T20:32:11.5893547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5894271Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.5894678Z 
2025-05-07T20:32:11.5894791Z     @given(
2025-05-07T20:32:11.5895130Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5895586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5896045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5896545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5897123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5897552Z     )
2025-05-07T20:32:11.5898038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5898678Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5899003Z         self,
2025-05-07T20:32:11.5899290Z         T: int,
2025-05-07T20:32:11.5899582Z         D: int,
2025-05-07T20:32:11.5899895Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5900299Z         contiguous: bool,
2025-05-07T20:32:11.5900653Z         compiled: bool,
2025-05-07T20:32:11.5900967Z     ) -> None:
2025-05-07T20:32:11.5901270Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5901623Z     
2025-05-07T20:32:11.5902015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5902516Z     
2025-05-07T20:32:11.5902798Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5903218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5903686Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5904041Z         x0 = x[:, :D]
2025-05-07T20:32:11.5904363Z         x1 = x[:, D:]
2025-05-07T20:32:11.5904671Z     
2025-05-07T20:32:11.5904944Z         if contiguous:
2025-05-07T20:32:11.5905280Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5905650Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5905996Z     
2025-05-07T20:32:11.5906274Z         if scale_ub is not None:
2025-05-07T20:32:11.5906673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5907166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5907617Z             )
2025-05-07T20:32:11.5907892Z         else:
2025-05-07T20:32:11.5908205Z             scale_ub_tensor = None
2025-05-07T20:32:11.5908578Z     
2025-05-07T20:32:11.5908910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5909368Z             op = silu_mul_quant
2025-05-07T20:32:11.5909733Z             if compiled:
2025-05-07T20:32:11.5910092Z                 op = torch.compile(op)
2025-05-07T20:32:11.5910542Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5910950Z     
2025-05-07T20:32:11.5911223Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.5911644Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.5912073Z     
2025-05-07T20:32:11.5912422Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5912896Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.5913306Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.5913996Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.5914464Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5914917Z     
2025-05-07T20:32:11.5915202Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.5915478Z 
2025-05-07T20:32:11.5915627Z moe/activation_test.py:126: 
2025-05-07T20:32:11.5916092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5916625Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.5917321Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5918485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.5919588Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.5920510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5921495Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5922505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.5923559Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5924635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.5925743Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.5926631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.5927385Z     fn()
2025-05-07T20:32:11.5928132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.5928976Z     self.fn.run(
2025-05-07T20:32:11.5929651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5930424Z     kernel = self.compile(
2025-05-07T20:32:11.5931245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5932219Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5932798Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5933137Z 
2025-05-07T20:32:11.5933444Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f236ddb50>
2025-05-07T20:32:11.5935016Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5937050Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f288f7240>}
2025-05-07T20:32:11.5939062Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5940547Z context = <triton._C.libtriton.ir.context object at 0x7f1f1637c9f0>
2025-05-07T20:32:11.5940975Z 
2025-05-07T20:32:11.5941224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5941995Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5942653Z                            module_map=module_map)
2025-05-07T20:32:11.5943170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5943638Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.5944008Z E       ^
2025-05-07T20:32:11.5944667Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5945280Z 
2025-05-07T20:32:11.5945896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5946621Z 
2025-05-07T20:32:11.5946765Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5947383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5962437Z     T=16384,
2025-05-07T20:32:11.5962750Z     D=7168,
2025-05-07T20:32:11.5963229Z     scale_ub=1200.0,
2025-05-07T20:32:11.5963558Z     contiguous=False,
2025-05-07T20:32:11.5963870Z     compiled=False,
2025-05-07T20:32:11.5964163Z )
2025-05-07T20:32:11.5964621Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5965361Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.5965759Z 
2025-05-07T20:32:11.5965867Z     @given(
2025-05-07T20:32:11.5966191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5966642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5967089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5967593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5968083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5968492Z     )
2025-05-07T20:32:11.5969018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5969833Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5970183Z         self,
2025-05-07T20:32:11.5970466Z         T: int,
2025-05-07T20:32:11.5970759Z         D: int,
2025-05-07T20:32:11.5970902Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5971029Z         contiguous: bool,
2025-05-07T20:32:11.5971166Z         compiled: bool,
2025-05-07T20:32:11.5971279Z     ) -> None:
2025-05-07T20:32:11.5971423Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5971541Z     
2025-05-07T20:32:11.5971786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5971894Z     
2025-05-07T20:32:11.5972030Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5972214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5972354Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5972471Z         x0 = x[:, :D]
2025-05-07T20:32:11.5972586Z         x1 = x[:, D:]
2025-05-07T20:32:11.5972704Z     
2025-05-07T20:32:11.5972827Z         if contiguous:
2025-05-07T20:32:11.5972964Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5973096Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5973200Z     
2025-05-07T20:32:11.5973331Z         if scale_ub is not None:
2025-05-07T20:32:11.5973492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5973691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5973802Z             )
2025-05-07T20:32:11.5973908Z         else:
2025-05-07T20:32:11.5974042Z             scale_ub_tensor = None
2025-05-07T20:32:11.5974149Z     
2025-05-07T20:32:11.5974334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5974462Z             op = silu_mul_quant
2025-05-07T20:32:11.5974586Z             if compiled:
2025-05-07T20:32:11.5974729Z                 op = torch.compile(op)
2025-05-07T20:32:11.5974879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5974990Z     
2025-05-07T20:32:11.5975131Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.5975137Z 
2025-05-07T20:32:11.5975286Z moe/activation_test.py:117: 
2025-05-07T20:32:11.5975480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5975623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.5975777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5976510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.5976652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.5977194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5977534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5978056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.5978201Z     kernel = self.compile(
2025-05-07T20:32:11.5978876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.5979142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.5979330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5979337Z 
2025-05-07T20:32:11.5979639Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f29918850>
2025-05-07T20:32:11.5980788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.5981541Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f164b6e80>}
2025-05-07T20:32:11.5982735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.5983014Z context = <triton._C.libtriton.ir.context object at 0x7f1f1489a530>
2025-05-07T20:32:11.5983021Z 
2025-05-07T20:32:11.5983273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.5983665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.5983823Z                            module_map=module_map)
2025-05-07T20:32:11.5984065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.5984209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.5984321Z E       ^
2025-05-07T20:32:11.5984854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.5984869Z 
2025-05-07T20:32:11.5985490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.5985497Z 
2025-05-07T20:32:11.5985661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.5985989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.5986102Z     T=1,
2025-05-07T20:32:11.5986223Z     D=7168,
2025-05-07T20:32:11.5986346Z     scale_ub=None,
2025-05-07T20:32:11.5986474Z     contiguous=True,
2025-05-07T20:32:11.5986607Z     compiled=True,
2025-05-07T20:32:11.5986719Z )
2025-05-07T20:32:11.5987045Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.5987295Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.5987302Z 
2025-05-07T20:32:11.5987415Z     @given(
2025-05-07T20:32:11.5987603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.5987750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.5987927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.5988118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.5988286Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.5988393Z     )
2025-05-07T20:32:11.5988752Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.5988886Z     def test_silu_mul_quant(
2025-05-07T20:32:11.5988998Z         self,
2025-05-07T20:32:11.5989105Z         T: int,
2025-05-07T20:32:11.5989210Z         D: int,
2025-05-07T20:32:11.5989354Z         scale_ub: Optional[float],
2025-05-07T20:32:11.5989481Z         contiguous: bool,
2025-05-07T20:32:11.5989601Z         compiled: bool,
2025-05-07T20:32:11.5989721Z     ) -> None:
2025-05-07T20:32:11.5989854Z         torch.manual_seed(2025)
2025-05-07T20:32:11.5989954Z     
2025-05-07T20:32:11.5990197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.5990312Z     
2025-05-07T20:32:11.5990442Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.5990717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.5990844Z         x = x_sign * x_clamp
2025-05-07T20:32:11.5990966Z         x0 = x[:, :D]
2025-05-07T20:32:11.5991075Z         x1 = x[:, D:]
2025-05-07T20:32:11.5991179Z     
2025-05-07T20:32:11.5991305Z         if contiguous:
2025-05-07T20:32:11.5991442Z             x0 = x0.contiguous()
2025-05-07T20:32:11.5991574Z             x1 = x1.contiguous()
2025-05-07T20:32:11.5991677Z     
2025-05-07T20:32:11.5991795Z         if scale_ub is not None:
2025-05-07T20:32:11.5991937Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.5992140Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.5992239Z             )
2025-05-07T20:32:11.5992337Z         else:
2025-05-07T20:32:11.5992467Z             scale_ub_tensor = None
2025-05-07T20:32:11.5992563Z     
2025-05-07T20:32:11.5992734Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5992956Z             op = silu_mul_quant
2025-05-07T20:32:11.5993067Z             if compiled:
2025-05-07T20:32:11.5993205Z                 op = torch.compile(op)
2025-05-07T20:32:11.5993340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.5993441Z     
2025-05-07T20:32:11.5993583Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.5993762Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.5993872Z     
2025-05-07T20:32:11.5994087Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.5994238Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.5994385Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.5994572Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.5994781Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5994896Z     
2025-05-07T20:32:11.5995043Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.5995058Z 
2025-05-07T20:32:11.5995204Z moe/activation_test.py:126: 
2025-05-07T20:32:11.5995411Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.5995564Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.5995764Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.5996586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.5996734Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.5997275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.5997606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.5998142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.5998545Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.5999160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.5999412Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.5999921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6000034Z     fn()
2025-05-07T20:32:11.6000757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6000885Z     self.fn.run(
2025-05-07T20:32:11.6001385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6001529Z     kernel = self.compile(
2025-05-07T20:32:11.6002092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6002463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6002669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6002676Z 
2025-05-07T20:32:11.6002975Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f298d21d0>
2025-05-07T20:32:11.6004121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6004876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f164b6700>}
2025-05-07T20:32:11.6006011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6006408Z context = <triton._C.libtriton.ir.context object at 0x7f1f1467abb0>
2025-05-07T20:32:11.6006415Z 
2025-05-07T20:32:11.6006661Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6007058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6007213Z                            module_map=module_map)
2025-05-07T20:32:11.6007452Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6007599Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6007711Z E       ^
2025-05-07T20:32:11.6008238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6008245Z 
2025-05-07T20:32:11.6008904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6008919Z 
2025-05-07T20:32:11.6009085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6009423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6009541Z     T=4096,
2025-05-07T20:32:11.6009662Z     D=5120,
2025-05-07T20:32:11.6009784Z     scale_ub=None,
2025-05-07T20:32:11.6009912Z     contiguous=False,
2025-05-07T20:32:11.6010050Z     compiled=False,
2025-05-07T20:32:11.6010165Z )
2025-05-07T20:32:11.6010469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6010706Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6010712Z 
2025-05-07T20:32:11.6010813Z     @given(
2025-05-07T20:32:11.6010969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6011106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6011252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6011411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6011572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6011671Z     )
2025-05-07T20:32:11.6011997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6012121Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6012233Z         self,
2025-05-07T20:32:11.6012357Z         T: int,
2025-05-07T20:32:11.6012470Z         D: int,
2025-05-07T20:32:11.6012618Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6012758Z         contiguous: bool,
2025-05-07T20:32:11.6012887Z         compiled: bool,
2025-05-07T20:32:11.6013018Z     ) -> None:
2025-05-07T20:32:11.6013158Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6013267Z     
2025-05-07T20:32:11.6013819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6013945Z     
2025-05-07T20:32:11.6014086Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6014283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6014427Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6014842Z         x0 = x[:, :D]
2025-05-07T20:32:11.6014993Z         x1 = x[:, D:]
2025-05-07T20:32:11.6015103Z     
2025-05-07T20:32:11.6015230Z         if contiguous:
2025-05-07T20:32:11.6015375Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6015508Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6015617Z     
2025-05-07T20:32:11.6015759Z         if scale_ub is not None:
2025-05-07T20:32:11.6015914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6016128Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6016239Z             )
2025-05-07T20:32:11.6016348Z         else:
2025-05-07T20:32:11.6016495Z             scale_ub_tensor = None
2025-05-07T20:32:11.6016604Z     
2025-05-07T20:32:11.6016789Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6016929Z             op = silu_mul_quant
2025-05-07T20:32:11.6017051Z             if compiled:
2025-05-07T20:32:11.6017373Z                 op = torch.compile(op)
2025-05-07T20:32:11.6017544Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6017649Z     
2025-05-07T20:32:11.6017781Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6017797Z 
2025-05-07T20:32:11.6017938Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6018124Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6018278Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6018427Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6019138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6019288Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6019810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6020151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6020689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6020824Z     kernel = self.compile(
2025-05-07T20:32:11.6021395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6021650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6021839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6021846Z 
2025-05-07T20:32:11.6022148Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2996edd0>
2025-05-07T20:32:11.6023127Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6023671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f28971d00>}
2025-05-07T20:32:11.6024453Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6024661Z context = <triton._C.libtriton.ir.context object at 0x7f1f146b56f0>
2025-05-07T20:32:11.6024666Z 
2025-05-07T20:32:11.6024840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6025116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6025235Z                            module_map=module_map)
2025-05-07T20:32:11.6025406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6025510Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6025599Z E       ^
2025-05-07T20:32:11.6026071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6026077Z 
2025-05-07T20:32:11.6026521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6026526Z 
2025-05-07T20:32:11.6026636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6026869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6026957Z     T=4096,
2025-05-07T20:32:11.6027038Z     D=7168,
2025-05-07T20:32:11.6027125Z     scale_ub=None,
2025-05-07T20:32:11.6027226Z     contiguous=False,
2025-05-07T20:32:11.6027315Z     compiled=False,
2025-05-07T20:32:11.6027401Z )
2025-05-07T20:32:11.6027634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6027819Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6027901Z 
2025-05-07T20:32:11.6027991Z     @given(
2025-05-07T20:32:11.6028125Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6028229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6028358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6028482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6028602Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6028694Z     )
2025-05-07T20:32:11.6028953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6029060Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6029142Z         self,
2025-05-07T20:32:11.6029224Z         T: int,
2025-05-07T20:32:11.6029313Z         D: int,
2025-05-07T20:32:11.6029417Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6029512Z         contiguous: bool,
2025-05-07T20:32:11.6029614Z         compiled: bool,
2025-05-07T20:32:11.6029698Z     ) -> None:
2025-05-07T20:32:11.6029810Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6029896Z     
2025-05-07T20:32:11.6030078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6030156Z     
2025-05-07T20:32:11.6030264Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6030397Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6030500Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6030585Z         x0 = x[:, :D]
2025-05-07T20:32:11.6030671Z         x1 = x[:, D:]
2025-05-07T20:32:11.6030765Z     
2025-05-07T20:32:11.6030852Z         if contiguous:
2025-05-07T20:32:11.6030948Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6031048Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6031123Z     
2025-05-07T20:32:11.6031220Z         if scale_ub is not None:
2025-05-07T20:32:11.6031335Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6031475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6031559Z             )
2025-05-07T20:32:11.6031644Z         else:
2025-05-07T20:32:11.6031746Z             scale_ub_tensor = None
2025-05-07T20:32:11.6031830Z     
2025-05-07T20:32:11.6031963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6032057Z             op = silu_mul_quant
2025-05-07T20:32:11.6032151Z             if compiled:
2025-05-07T20:32:11.6032253Z                 op = torch.compile(op)
2025-05-07T20:32:11.6032363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6032445Z     
2025-05-07T20:32:11.6032539Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6032544Z 
2025-05-07T20:32:11.6032654Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6032787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6032892Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6033002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6033517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6033714Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6034095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6034330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6034693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6034792Z     kernel = self.compile(
2025-05-07T20:32:11.6035190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6035385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6035519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6035523Z 
2025-05-07T20:32:11.6035738Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f149d1c50>
2025-05-07T20:32:11.6036633Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6037158Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1663e0c0>}
2025-05-07T20:32:11.6037933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6038134Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3eaaaf0>
2025-05-07T20:32:11.6038139Z 
2025-05-07T20:32:11.6038317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6038618Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6038758Z                            module_map=module_map)
2025-05-07T20:32:11.6038935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6039038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6039118Z E       ^
2025-05-07T20:32:11.6039492Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6039497Z 
2025-05-07T20:32:11.6039927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6039932Z 
2025-05-07T20:32:11.6040045Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6040385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6040466Z     T=128,
2025-05-07T20:32:11.6040553Z     D=7168,
2025-05-07T20:32:11.6040638Z     scale_ub=None,
2025-05-07T20:32:11.6040736Z     contiguous=False,
2025-05-07T20:32:11.6040830Z     compiled=True,
2025-05-07T20:32:11.6040913Z )
2025-05-07T20:32:11.6041149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6041328Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6041332Z 
2025-05-07T20:32:11.6041412Z     @given(
2025-05-07T20:32:11.6041543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6041649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6041769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6041895Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6042013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6042097Z     )
2025-05-07T20:32:11.6042354Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6042454Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6042550Z         self,
2025-05-07T20:32:11.6042631Z         T: int,
2025-05-07T20:32:11.6042794Z         D: int,
2025-05-07T20:32:11.6042906Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6042999Z         contiguous: bool,
2025-05-07T20:32:11.6043090Z         compiled: bool,
2025-05-07T20:32:11.6043177Z     ) -> None:
2025-05-07T20:32:11.6043276Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6043352Z     
2025-05-07T20:32:11.6043535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6043612Z     
2025-05-07T20:32:11.6043708Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6043847Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6043939Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6044031Z         x0 = x[:, :D]
2025-05-07T20:32:11.6044116Z         x1 = x[:, D:]
2025-05-07T20:32:11.6044193Z     
2025-05-07T20:32:11.6044288Z         if contiguous:
2025-05-07T20:32:11.6044383Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6044555Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6044637Z     
2025-05-07T20:32:11.6044736Z         if scale_ub is not None:
2025-05-07T20:32:11.6044847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6044995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6045078Z             )
2025-05-07T20:32:11.6045159Z         else:
2025-05-07T20:32:11.6045265Z             scale_ub_tensor = None
2025-05-07T20:32:11.6045340Z     
2025-05-07T20:32:11.6045483Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6045576Z             op = silu_mul_quant
2025-05-07T20:32:11.6045665Z             if compiled:
2025-05-07T20:32:11.6045777Z                 op = torch.compile(op)
2025-05-07T20:32:11.6045886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6045964Z     
2025-05-07T20:32:11.6046065Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6046191Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6046272Z     
2025-05-07T20:32:11.6046423Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6046529Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6046632Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6046765Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6046910Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6046991Z     
2025-05-07T20:32:11.6047094Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6047098Z 
2025-05-07T20:32:11.6047201Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6047341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6047453Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6047592Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6048174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6048290Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6048669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6048905Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6049284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6049558Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6049949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6050128Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6050483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6050568Z     fn()
2025-05-07T20:32:11.6051103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6051191Z     self.fn.run(
2025-05-07T20:32:11.6051543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6051647Z     kernel = self.compile(
2025-05-07T20:32:11.6052039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6052230Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6052363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6052367Z 
2025-05-07T20:32:11.6052579Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f149d08d0>
2025-05-07T20:32:11.6053390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6053987Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f1663d940>}
2025-05-07T20:32:11.6054760Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6054961Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3b0dbb0>
2025-05-07T20:32:11.6054965Z 
2025-05-07T20:32:11.6055136Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6055415Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6055528Z                            module_map=module_map)
2025-05-07T20:32:11.6055706Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6055820Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6055900Z E       ^
2025-05-07T20:32:11.6056273Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6056278Z 
2025-05-07T20:32:11.6056709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6056714Z 
2025-05-07T20:32:11.6056832Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6057063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6057145Z     T=128,
2025-05-07T20:32:11.6057231Z     D=7168,
2025-05-07T20:32:11.6057317Z     scale_ub=None,
2025-05-07T20:32:11.6057409Z     contiguous=False,
2025-05-07T20:32:11.6057502Z     compiled=False,
2025-05-07T20:32:11.6057579Z )
2025-05-07T20:32:11.6057806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6058001Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6058006Z 
2025-05-07T20:32:11.6058085Z     @given(
2025-05-07T20:32:11.6058216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6058319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6058439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6058593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6058733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6058819Z     )
2025-05-07T20:32:11.6059080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6059179Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6059260Z         self,
2025-05-07T20:32:11.6059349Z         T: int,
2025-05-07T20:32:11.6059432Z         D: int,
2025-05-07T20:32:11.6059534Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6059642Z         contiguous: bool,
2025-05-07T20:32:11.6059843Z         compiled: bool,
2025-05-07T20:32:11.6059934Z     ) -> None:
2025-05-07T20:32:11.6060033Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6060110Z     
2025-05-07T20:32:11.6060291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6060368Z     
2025-05-07T20:32:11.6060462Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6060598Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6060691Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6060775Z         x0 = x[:, :D]
2025-05-07T20:32:11.6060866Z         x1 = x[:, D:]
2025-05-07T20:32:11.6060943Z     
2025-05-07T20:32:11.6061030Z         if contiguous:
2025-05-07T20:32:11.6061136Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6061231Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6061314Z     
2025-05-07T20:32:11.6061409Z         if scale_ub is not None:
2025-05-07T20:32:11.6061598Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6061750Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6061830Z             )
2025-05-07T20:32:11.6061909Z         else:
2025-05-07T20:32:11.6062013Z             scale_ub_tensor = None
2025-05-07T20:32:11.6062091Z     
2025-05-07T20:32:11.6062226Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6062326Z             op = silu_mul_quant
2025-05-07T20:32:11.6062415Z             if compiled:
2025-05-07T20:32:11.6062518Z                 op = torch.compile(op)
2025-05-07T20:32:11.6062634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6062710Z     
2025-05-07T20:32:11.6062811Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6062815Z 
2025-05-07T20:32:11.6062915Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6063048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6063159Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6063272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6063794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6063901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6064273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6064514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6064868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6064966Z     kernel = self.compile(
2025-05-07T20:32:11.6065367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6065548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6065684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6065700Z 
2025-05-07T20:32:11.6065911Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f289dc4d0>
2025-05-07T20:32:11.6066710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6067240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f143fa700>}
2025-05-07T20:32:11.6068008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6068213Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3b351b0>
2025-05-07T20:32:11.6068222Z 
2025-05-07T20:32:11.6068474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6068789Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6068917Z                            module_map=module_map)
2025-05-07T20:32:11.6069084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6069187Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6069275Z E       ^
2025-05-07T20:32:11.6069640Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6069645Z 
2025-05-07T20:32:11.6070080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6070084Z 
2025-05-07T20:32:11.6070193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6070423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6070586Z     T=4096,
2025-05-07T20:32:11.6070670Z     D=5120,
2025-05-07T20:32:11.6070764Z     scale_ub=1200.0,
2025-05-07T20:32:11.6070852Z     contiguous=True,
2025-05-07T20:32:11.6070940Z     compiled=False,
2025-05-07T20:32:11.6071021Z )
2025-05-07T20:32:11.6071248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6071429Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.6071434Z 
2025-05-07T20:32:11.6071520Z     @given(
2025-05-07T20:32:11.6071642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6071745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6071869Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6071993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6072120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6072198Z     )
2025-05-07T20:32:11.6072466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6072570Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6072650Z         self,
2025-05-07T20:32:11.6072729Z         T: int,
2025-05-07T20:32:11.6072817Z         D: int,
2025-05-07T20:32:11.6072919Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6073012Z         contiguous: bool,
2025-05-07T20:32:11.6073108Z         compiled: bool,
2025-05-07T20:32:11.6073190Z     ) -> None:
2025-05-07T20:32:11.6073288Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6073372Z     
2025-05-07T20:32:11.6073547Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6073631Z     
2025-05-07T20:32:11.6073727Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6073857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6073957Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6074041Z         x0 = x[:, :D]
2025-05-07T20:32:11.6074131Z         x1 = x[:, D:]
2025-05-07T20:32:11.6074214Z     
2025-05-07T20:32:11.6074306Z         if contiguous:
2025-05-07T20:32:11.6074400Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6074500Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6074576Z     
2025-05-07T20:32:11.6074673Z         if scale_ub is not None:
2025-05-07T20:32:11.6074788Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6074929Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6075009Z             )
2025-05-07T20:32:11.6075097Z         else:
2025-05-07T20:32:11.6075195Z             scale_ub_tensor = None
2025-05-07T20:32:11.6075277Z     
2025-05-07T20:32:11.6075414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6075509Z             op = silu_mul_quant
2025-05-07T20:32:11.6075607Z             if compiled:
2025-05-07T20:32:11.6075712Z                 op = torch.compile(op)
2025-05-07T20:32:11.6075824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6075912Z     
2025-05-07T20:32:11.6076008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6076095Z 
2025-05-07T20:32:11.6076199Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6076338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6076445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6076557Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6077074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6077176Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6077554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6077789Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6078142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6078349Z     kernel = self.compile(
2025-05-07T20:32:11.6078749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6078937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6079069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6079073Z 
2025-05-07T20:32:11.6079287Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3fbbe50>
2025-05-07T20:32:11.6080101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6080742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f143f8220>}
2025-05-07T20:32:11.6081528Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6081728Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3b98cf0>
2025-05-07T20:32:11.6081733Z 
2025-05-07T20:32:11.6081911Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6082186Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6082298Z                            module_map=module_map)
2025-05-07T20:32:11.6082471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6082574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6082654Z E       ^
2025-05-07T20:32:11.6083026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6083035Z 
2025-05-07T20:32:11.6083466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6083471Z 
2025-05-07T20:32:11.6083588Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6083819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6083899Z     T=1,
2025-05-07T20:32:11.6083985Z     D=5120,
2025-05-07T20:32:11.6084073Z     scale_ub=None,
2025-05-07T20:32:11.6084163Z     contiguous=True,
2025-05-07T20:32:11.6084256Z     compiled=True,
2025-05-07T20:32:11.6084334Z )
2025-05-07T20:32:11.6084562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6084735Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6084739Z 
2025-05-07T20:32:11.6084818Z     @given(
2025-05-07T20:32:11.6084947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6085056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6085257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6085387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6085505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6085583Z     )
2025-05-07T20:32:11.6085842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6085938Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6086018Z         self,
2025-05-07T20:32:11.6086105Z         T: int,
2025-05-07T20:32:11.6086186Z         D: int,
2025-05-07T20:32:11.6086295Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6086388Z         contiguous: bool,
2025-05-07T20:32:11.6086477Z         compiled: bool,
2025-05-07T20:32:11.6086564Z     ) -> None:
2025-05-07T20:32:11.6086665Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6086742Z     
2025-05-07T20:32:11.6086923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6087077Z     
2025-05-07T20:32:11.6087178Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6087316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6087408Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6087491Z         x0 = x[:, :D]
2025-05-07T20:32:11.6087581Z         x1 = x[:, D:]
2025-05-07T20:32:11.6087658Z     
2025-05-07T20:32:11.6087753Z         if contiguous:
2025-05-07T20:32:11.6087846Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6087950Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6088027Z     
2025-05-07T20:32:11.6088122Z         if scale_ub is not None:
2025-05-07T20:32:11.6088238Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6088378Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6088466Z             )
2025-05-07T20:32:11.6088547Z         else:
2025-05-07T20:32:11.6088646Z             scale_ub_tensor = None
2025-05-07T20:32:11.6088729Z     
2025-05-07T20:32:11.6088868Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6088967Z             op = silu_mul_quant
2025-05-07T20:32:11.6089062Z             if compiled:
2025-05-07T20:32:11.6089166Z                 op = torch.compile(op)
2025-05-07T20:32:11.6089276Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6089361Z     
2025-05-07T20:32:11.6089459Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6089584Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6089667Z     
2025-05-07T20:32:11.6089806Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6089921Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6090026Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6090152Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6090305Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6090382Z     
2025-05-07T20:32:11.6090491Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6090495Z 
2025-05-07T20:32:11.6090610Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6090742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6090852Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6090995Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6091572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6091684Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6092055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6092290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6092675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6093029Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6093426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6093599Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6093953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6094041Z     fn()
2025-05-07T20:32:11.6094456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6094542Z     self.fn.run(
2025-05-07T20:32:11.6094898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6094996Z     kernel = self.compile(
2025-05-07T20:32:11.6095394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6095657Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6095791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6095795Z 
2025-05-07T20:32:11.6096019Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f289dca50>
2025-05-07T20:32:11.6096826Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6097354Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f143fae80>}
2025-05-07T20:32:11.6098122Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6098331Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3bf6bb0>
2025-05-07T20:32:11.6098342Z 
2025-05-07T20:32:11.6098516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6098791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6098909Z                            module_map=module_map)
2025-05-07T20:32:11.6099076Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6099183Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6099271Z E       ^
2025-05-07T20:32:11.6099638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6099643Z 
2025-05-07T20:32:11.6100077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6100087Z 
2025-05-07T20:32:11.6100200Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6100436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6100522Z     T=2048,
2025-05-07T20:32:11.6100599Z     D=5120,
2025-05-07T20:32:11.6100702Z     scale_ub=None,
2025-05-07T20:32:11.6100792Z     contiguous=True,
2025-05-07T20:32:11.6100888Z     compiled=True,
2025-05-07T20:32:11.6100966Z )
2025-05-07T20:32:11.6101194Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6101378Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6101382Z 
2025-05-07T20:32:11.6101462Z     @given(
2025-05-07T20:32:11.6101586Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6101697Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6101818Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6101951Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6102157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6102238Z     )
2025-05-07T20:32:11.6102502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6102600Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6102681Z         self,
2025-05-07T20:32:11.6102771Z         T: int,
2025-05-07T20:32:11.6102851Z         D: int,
2025-05-07T20:32:11.6102954Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6103055Z         contiguous: bool,
2025-05-07T20:32:11.6103146Z         compiled: bool,
2025-05-07T20:32:11.6103229Z     ) -> None:
2025-05-07T20:32:11.6103336Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6103419Z     
2025-05-07T20:32:11.6103594Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6103679Z     
2025-05-07T20:32:11.6103865Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6104081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6104287Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6104592Z         x0 = x[:, :D]
2025-05-07T20:32:11.6104720Z         x1 = x[:, D:]
2025-05-07T20:32:11.6104858Z     
2025-05-07T20:32:11.6104999Z         if contiguous:
2025-05-07T20:32:11.6112837Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6112967Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6113050Z     
2025-05-07T20:32:11.6113159Z         if scale_ub is not None:
2025-05-07T20:32:11.6113282Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6113762Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6113901Z             )
2025-05-07T20:32:11.6114004Z         else:
2025-05-07T20:32:11.6114107Z             scale_ub_tensor = None
2025-05-07T20:32:11.6114195Z     
2025-05-07T20:32:11.6114339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6114439Z             op = silu_mul_quant
2025-05-07T20:32:11.6114556Z             if compiled:
2025-05-07T20:32:11.6114674Z                 op = torch.compile(op)
2025-05-07T20:32:11.6114797Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6114876Z     
2025-05-07T20:32:11.6114974Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6115111Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6115190Z     
2025-05-07T20:32:11.6115334Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6115452Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6115558Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6115689Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6115845Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6115924Z     
2025-05-07T20:32:11.6116040Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6116046Z 
2025-05-07T20:32:11.6116152Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6116296Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6116422Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6116565Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6117152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6117269Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6117648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6117897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6118283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6118556Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6119203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6119385Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6119753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6119838Z     fn()
2025-05-07T20:32:11.6120330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6120429Z     self.fn.run(
2025-05-07T20:32:11.6120783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6120884Z     kernel = self.compile(
2025-05-07T20:32:11.6121292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6121478Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6121757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6121762Z 
2025-05-07T20:32:11.6121977Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f289dd3d0>
2025-05-07T20:32:11.6122786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6123319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f141f0b80>}
2025-05-07T20:32:11.6124088Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6124298Z context = <triton._C.libtriton.ir.context object at 0x7f1ef37a80b0>
2025-05-07T20:32:11.6124308Z 
2025-05-07T20:32:11.6124486Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6124761Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6124882Z                            module_map=module_map)
2025-05-07T20:32:11.6125052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6125168Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6125250Z E       ^
2025-05-07T20:32:11.6125621Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6125626Z 
2025-05-07T20:32:11.6126064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6126069Z 
2025-05-07T20:32:11.6126178Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6126418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6126514Z     T=128,
2025-05-07T20:32:11.6126595Z     D=5120,
2025-05-07T20:32:11.6126690Z     scale_ub=None,
2025-05-07T20:32:11.6126780Z     contiguous=True,
2025-05-07T20:32:11.6126867Z     compiled=True,
2025-05-07T20:32:11.6126953Z )
2025-05-07T20:32:11.6127182Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6127359Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6127364Z 
2025-05-07T20:32:11.6127453Z     @given(
2025-05-07T20:32:11.6127576Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6127687Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6127809Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6127933Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6128059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6128143Z     )
2025-05-07T20:32:11.6128479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6128591Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6128677Z         self,
2025-05-07T20:32:11.6128778Z         T: int,
2025-05-07T20:32:11.6128870Z         D: int,
2025-05-07T20:32:11.6128992Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6129086Z         contiguous: bool,
2025-05-07T20:32:11.6129183Z         compiled: bool,
2025-05-07T20:32:11.6129266Z     ) -> None:
2025-05-07T20:32:11.6129372Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6129453Z     
2025-05-07T20:32:11.6129630Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6129713Z     
2025-05-07T20:32:11.6129811Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6129941Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6130041Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6130126Z         x0 = x[:, :D]
2025-05-07T20:32:11.6130287Z         x1 = x[:, D:]
2025-05-07T20:32:11.6130376Z     
2025-05-07T20:32:11.6130472Z         if contiguous:
2025-05-07T20:32:11.6130569Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6130671Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6130749Z     
2025-05-07T20:32:11.6130844Z         if scale_ub is not None:
2025-05-07T20:32:11.6130967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6131109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6131198Z             )
2025-05-07T20:32:11.6131281Z         else:
2025-05-07T20:32:11.6131383Z             scale_ub_tensor = None
2025-05-07T20:32:11.6131471Z     
2025-05-07T20:32:11.6131607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6131702Z             op = silu_mul_quant
2025-05-07T20:32:11.6131800Z             if compiled:
2025-05-07T20:32:11.6131906Z                 op = torch.compile(op)
2025-05-07T20:32:11.6132019Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6132112Z     
2025-05-07T20:32:11.6132213Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6132340Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6132424Z     
2025-05-07T20:32:11.6132565Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6132680Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6132785Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6132912Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6133066Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6133147Z     
2025-05-07T20:32:11.6133253Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6133257Z 
2025-05-07T20:32:11.6133373Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6133507Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6133627Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6133774Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6134358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6134473Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6134855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6135089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6135481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6135751Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6136153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6136332Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6136773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6136863Z     fn()
2025-05-07T20:32:11.6137279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6137367Z     self.fn.run(
2025-05-07T20:32:11.6137729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6137828Z     kernel = self.compile(
2025-05-07T20:32:11.6138230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6138413Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6138546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6138551Z 
2025-05-07T20:32:11.6138848Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3749750>
2025-05-07T20:32:11.6139659Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6140190Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f141f1da0>}
2025-05-07T20:32:11.6140964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6141173Z context = <triton._C.libtriton.ir.context object at 0x7f1ef3999ef0>
2025-05-07T20:32:11.6141178Z 
2025-05-07T20:32:11.6141351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6141636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6141757Z                            module_map=module_map)
2025-05-07T20:32:11.6141926Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6142033Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6142122Z E       ^
2025-05-07T20:32:11.6142490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6142495Z 
2025-05-07T20:32:11.6142935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6142939Z 
2025-05-07T20:32:11.6143050Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6143283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6143372Z     T=4096,
2025-05-07T20:32:11.6143455Z     D=5120,
2025-05-07T20:32:11.6143548Z     scale_ub=None,
2025-05-07T20:32:11.6143646Z     contiguous=True,
2025-05-07T20:32:11.6143740Z     compiled=True,
2025-05-07T20:32:11.6143817Z )
2025-05-07T20:32:11.6144051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6144228Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6144232Z 
2025-05-07T20:32:11.6144318Z     @given(
2025-05-07T20:32:11.6144445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6144550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6144677Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6144799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6144917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6145002Z     )
2025-05-07T20:32:11.6145258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6145370Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6145451Z         self,
2025-05-07T20:32:11.6145640Z         T: int,
2025-05-07T20:32:11.6145731Z         D: int,
2025-05-07T20:32:11.6145836Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6145932Z         contiguous: bool,
2025-05-07T20:32:11.6146032Z         compiled: bool,
2025-05-07T20:32:11.6146115Z     ) -> None:
2025-05-07T20:32:11.6146214Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6146297Z     
2025-05-07T20:32:11.6146474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6146553Z     
2025-05-07T20:32:11.6146657Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6146788Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6146890Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6146974Z         x0 = x[:, :D]
2025-05-07T20:32:11.6147060Z         x1 = x[:, D:]
2025-05-07T20:32:11.6147147Z     
2025-05-07T20:32:11.6147235Z         if contiguous:
2025-05-07T20:32:11.6147411Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6147515Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6147598Z     
2025-05-07T20:32:11.6147693Z         if scale_ub is not None:
2025-05-07T20:32:11.6147813Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6147954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6148035Z             )
2025-05-07T20:32:11.6148123Z         else:
2025-05-07T20:32:11.6148222Z             scale_ub_tensor = None
2025-05-07T20:32:11.6148299Z     
2025-05-07T20:32:11.6148447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6148560Z             op = silu_mul_quant
2025-05-07T20:32:11.6148667Z             if compiled:
2025-05-07T20:32:11.6148795Z                 op = torch.compile(op)
2025-05-07T20:32:11.6148906Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6148990Z     
2025-05-07T20:32:11.6149086Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6149215Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6149310Z     
2025-05-07T20:32:11.6149460Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6149567Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6149679Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6149808Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6149965Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6150045Z     
2025-05-07T20:32:11.6150150Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6150154Z 
2025-05-07T20:32:11.6150266Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6150401Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6150511Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6150658Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6151245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6151365Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6151739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6151974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6152360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6152632Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6153023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6153204Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6153557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6153653Z     fn()
2025-05-07T20:32:11.6154157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6154246Z     self.fn.run(
2025-05-07T20:32:11.6154605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6154704Z     kernel = self.compile(
2025-05-07T20:32:11.6155100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6155292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6155424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6155429Z 
2025-05-07T20:32:11.6155649Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2a08c350>
2025-05-07T20:32:11.6156454Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6157063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef32bf1a0>}
2025-05-07T20:32:11.6157836Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6158037Z context = <triton._C.libtriton.ir.context object at 0x7f1ef310d8b0>
2025-05-07T20:32:11.6158041Z 
2025-05-07T20:32:11.6158222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6158500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6158625Z                            module_map=module_map)
2025-05-07T20:32:11.6158798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6158905Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6158994Z E       ^
2025-05-07T20:32:11.6159362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6159366Z 
2025-05-07T20:32:11.6159796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6159807Z 
2025-05-07T20:32:11.6159917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6160231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6160323Z     T=16384,
2025-05-07T20:32:11.6160403Z     D=5120,
2025-05-07T20:32:11.6160489Z     scale_ub=None,
2025-05-07T20:32:11.6160586Z     contiguous=True,
2025-05-07T20:32:11.6160673Z     compiled=True,
2025-05-07T20:32:11.6160757Z )
2025-05-07T20:32:11.6160996Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6161180Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6161184Z 
2025-05-07T20:32:11.6161265Z     @given(
2025-05-07T20:32:11.6161400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6161504Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6161635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6161758Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6161877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6161964Z     )
2025-05-07T20:32:11.6162220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6162318Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6162404Z         self,
2025-05-07T20:32:11.6162486Z         T: int,
2025-05-07T20:32:11.6162566Z         D: int,
2025-05-07T20:32:11.6162680Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6162857Z         contiguous: bool,
2025-05-07T20:32:11.6162958Z         compiled: bool,
2025-05-07T20:32:11.6163043Z     ) -> None:
2025-05-07T20:32:11.6163142Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6163226Z     
2025-05-07T20:32:11.6163401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6163479Z     
2025-05-07T20:32:11.6163582Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6163717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6163814Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6163908Z         x0 = x[:, :D]
2025-05-07T20:32:11.6163994Z         x1 = x[:, D:]
2025-05-07T20:32:11.6164071Z     
2025-05-07T20:32:11.6164171Z         if contiguous:
2025-05-07T20:32:11.6164271Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6164366Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6164452Z     
2025-05-07T20:32:11.6164626Z         if scale_ub is not None:
2025-05-07T20:32:11.6164751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6164893Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6164975Z             )
2025-05-07T20:32:11.6165064Z         else:
2025-05-07T20:32:11.6165163Z             scale_ub_tensor = None
2025-05-07T20:32:11.6165240Z     
2025-05-07T20:32:11.6165385Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6165481Z             op = silu_mul_quant
2025-05-07T20:32:11.6165570Z             if compiled:
2025-05-07T20:32:11.6165681Z                 op = torch.compile(op)
2025-05-07T20:32:11.6165793Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6165871Z     
2025-05-07T20:32:11.6165972Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6166099Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6166181Z     
2025-05-07T20:32:11.6166322Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6166433Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6166547Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6166674Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6166826Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6166903Z     
2025-05-07T20:32:11.6167007Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6167011Z 
2025-05-07T20:32:11.6167119Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6167252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6167362Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6167507Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6168084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6168198Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6168602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6168865Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6169250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6169515Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6169904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6170083Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6170436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6170521Z     fn()
2025-05-07T20:32:11.6170936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6171110Z     self.fn.run(
2025-05-07T20:32:11.6171466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6171564Z     kernel = self.compile(
2025-05-07T20:32:11.6171957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6172147Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6172282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6172287Z 
2025-05-07T20:32:11.6172506Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f299c7150>
2025-05-07T20:32:11.6173308Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6173942Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef3a90860>}
2025-05-07T20:32:11.6174711Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6174911Z context = <triton._C.libtriton.ir.context object at 0x7f1ef2662370>
2025-05-07T20:32:11.6174916Z 
2025-05-07T20:32:11.6175097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6175373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6175493Z                            module_map=module_map)
2025-05-07T20:32:11.6175662Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6175775Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6175864Z E       ^
2025-05-07T20:32:11.6176236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6176241Z 
2025-05-07T20:32:11.6176671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6176683Z 
2025-05-07T20:32:11.6176795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6177026Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6177113Z     T=1,
2025-05-07T20:32:11.6177193Z     D=5120,
2025-05-07T20:32:11.6177284Z     scale_ub=1200.0,
2025-05-07T20:32:11.6177380Z     contiguous=True,
2025-05-07T20:32:11.6177467Z     compiled=True,
2025-05-07T20:32:11.6177544Z )
2025-05-07T20:32:11.6177778Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6177950Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6177959Z 
2025-05-07T20:32:11.6178042Z     @given(
2025-05-07T20:32:11.6178175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6178278Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6178405Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6178531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6178651Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6178736Z     )
2025-05-07T20:32:11.6178992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6179091Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6179180Z         self,
2025-05-07T20:32:11.6179263Z         T: int,
2025-05-07T20:32:11.6179343Z         D: int,
2025-05-07T20:32:11.6179450Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6179543Z         contiguous: bool,
2025-05-07T20:32:11.6179639Z         compiled: bool,
2025-05-07T20:32:11.6179728Z     ) -> None:
2025-05-07T20:32:11.6179908Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6179994Z     
2025-05-07T20:32:11.6180169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6180244Z     
2025-05-07T20:32:11.6180346Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6180476Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6180568Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6180658Z         x0 = x[:, :D]
2025-05-07T20:32:11.6180741Z         x1 = x[:, D:]
2025-05-07T20:32:11.6180817Z     
2025-05-07T20:32:11.6180910Z         if contiguous:
2025-05-07T20:32:11.6181005Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6181099Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6181180Z     
2025-05-07T20:32:11.6181273Z         if scale_ub is not None:
2025-05-07T20:32:11.6181389Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6181528Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6181686Z             )
2025-05-07T20:32:11.6181777Z         else:
2025-05-07T20:32:11.6181874Z             scale_ub_tensor = None
2025-05-07T20:32:11.6181950Z     
2025-05-07T20:32:11.6182090Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6182184Z             op = silu_mul_quant
2025-05-07T20:32:11.6182276Z             if compiled:
2025-05-07T20:32:11.6182387Z                 op = torch.compile(op)
2025-05-07T20:32:11.6182499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6182575Z     
2025-05-07T20:32:11.6182676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6182681Z 
2025-05-07T20:32:11.6182781Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6182918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6183023Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6183127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6183520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6183623Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6184136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6184245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6184616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6184855Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6185209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6185308Z     kernel = self.compile(
2025-05-07T20:32:11.6185709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6185897Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6186040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6186044Z 
2025-05-07T20:32:11.6186257Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f236ddf50>
2025-05-07T20:32:11.6187058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6187587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2d12ca0>}
2025-05-07T20:32:11.6188357Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6188647Z context = <triton._C.libtriton.ir.context object at 0x7f1ef268e030>
2025-05-07T20:32:11.6188652Z 
2025-05-07T20:32:11.6188827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6189103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6189221Z                            module_map=module_map)
2025-05-07T20:32:11.6189389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6189501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6189582Z E       ^
2025-05-07T20:32:11.6189949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6189953Z 
2025-05-07T20:32:11.6190390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6190395Z 
2025-05-07T20:32:11.6190503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6190823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6190907Z     T=1,
2025-05-07T20:32:11.6190986Z     D=5120,
2025-05-07T20:32:11.6191079Z     scale_ub=None,
2025-05-07T20:32:11.6191171Z     contiguous=False,
2025-05-07T20:32:11.6191261Z     compiled=True,
2025-05-07T20:32:11.6191345Z )
2025-05-07T20:32:11.6191573Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6191744Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6191748Z 
2025-05-07T20:32:11.6191839Z     @given(
2025-05-07T20:32:11.6191963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6192067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6192196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6192319Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6192443Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6192528Z     )
2025-05-07T20:32:11.6192788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6192892Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6192973Z         self,
2025-05-07T20:32:11.6193057Z         T: int,
2025-05-07T20:32:11.6193144Z         D: int,
2025-05-07T20:32:11.6193246Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6193340Z         contiguous: bool,
2025-05-07T20:32:11.6193437Z         compiled: bool,
2025-05-07T20:32:11.6193519Z     ) -> None:
2025-05-07T20:32:11.6193618Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6193701Z     
2025-05-07T20:32:11.6193876Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6193959Z     
2025-05-07T20:32:11.6194055Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6194185Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6194286Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6194377Z         x0 = x[:, :D]
2025-05-07T20:32:11.6194465Z         x1 = x[:, D:]
2025-05-07T20:32:11.6194548Z     
2025-05-07T20:32:11.6194635Z         if contiguous:
2025-05-07T20:32:11.6194731Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6194830Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6194907Z     
2025-05-07T20:32:11.6195001Z         if scale_ub is not None:
2025-05-07T20:32:11.6195119Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6195259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6195342Z             )
2025-05-07T20:32:11.6195421Z         else:
2025-05-07T20:32:11.6195519Z             scale_ub_tensor = None
2025-05-07T20:32:11.6195600Z     
2025-05-07T20:32:11.6195733Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6195827Z             op = silu_mul_quant
2025-05-07T20:32:11.6195921Z             if compiled:
2025-05-07T20:32:11.6196025Z                 op = torch.compile(op)
2025-05-07T20:32:11.6196140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6196307Z     
2025-05-07T20:32:11.6196404Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6196528Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6196611Z     
2025-05-07T20:32:11.6196752Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6196864Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6196969Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6197093Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6197244Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6197320Z     
2025-05-07T20:32:11.6197425Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6197429Z 
2025-05-07T20:32:11.6197537Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6197669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6197855Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6198007Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6198588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6198702Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6199074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6199309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6199694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6199960Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6200438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6200623Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6200979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6201065Z     fn()
2025-05-07T20:32:11.6201478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6201563Z     self.fn.run(
2025-05-07T20:32:11.6201916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6202014Z     kernel = self.compile(
2025-05-07T20:32:11.6202411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6202592Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6202724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6202734Z 
2025-05-07T20:32:11.6202957Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f2a66fdd0>
2025-05-07T20:32:11.6203758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6204288Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef38c60c0>}
2025-05-07T20:32:11.6205054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6205253Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1c521b0>
2025-05-07T20:32:11.6205264Z 
2025-05-07T20:32:11.6205441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6205800Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6205920Z                            module_map=module_map)
2025-05-07T20:32:11.6206088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6206195Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6206284Z E       ^
2025-05-07T20:32:11.6206651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6206656Z 
2025-05-07T20:32:11.6207091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6207095Z 
2025-05-07T20:32:11.6207203Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6207434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6207623Z     T=1,
2025-05-07T20:32:11.6207704Z     D=5120,
2025-05-07T20:32:11.6207798Z     scale_ub=None,
2025-05-07T20:32:11.6207895Z     contiguous=True,
2025-05-07T20:32:11.6207983Z     compiled=False,
2025-05-07T20:32:11.6208060Z )
2025-05-07T20:32:11.6208293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6208462Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6208466Z 
2025-05-07T20:32:11.6208553Z     @given(
2025-05-07T20:32:11.6208678Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6208781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6208907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6209032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6209151Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6209237Z     )
2025-05-07T20:32:11.6209494Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6209608Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6209694Z         self,
2025-05-07T20:32:11.6209775Z         T: int,
2025-05-07T20:32:11.6209865Z         D: int,
2025-05-07T20:32:11.6209967Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6210061Z         contiguous: bool,
2025-05-07T20:32:11.6210163Z         compiled: bool,
2025-05-07T20:32:11.6210245Z     ) -> None:
2025-05-07T20:32:11.6210347Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6210430Z     
2025-05-07T20:32:11.6210603Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6210682Z     
2025-05-07T20:32:11.6210786Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6210918Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6211011Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6211105Z         x0 = x[:, :D]
2025-05-07T20:32:11.6211190Z         x1 = x[:, D:]
2025-05-07T20:32:11.6211273Z     
2025-05-07T20:32:11.6211367Z         if contiguous:
2025-05-07T20:32:11.6211463Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6211570Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6211647Z     
2025-05-07T20:32:11.6211740Z         if scale_ub is not None:
2025-05-07T20:32:11.6211856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6211996Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6212075Z             )
2025-05-07T20:32:11.6212161Z         else:
2025-05-07T20:32:11.6212258Z             scale_ub_tensor = None
2025-05-07T20:32:11.6212333Z     
2025-05-07T20:32:11.6212471Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6212564Z             op = silu_mul_quant
2025-05-07T20:32:11.6212657Z             if compiled:
2025-05-07T20:32:11.6212760Z                 op = torch.compile(op)
2025-05-07T20:32:11.6212869Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6212950Z     
2025-05-07T20:32:11.6213044Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6213054Z 
2025-05-07T20:32:11.6213239Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6213639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6213795Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6213918Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6214448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6214552Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6214933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6215168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6215523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6215628Z     kernel = self.compile(
2025-05-07T20:32:11.6216230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6216422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6216560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6216565Z 
2025-05-07T20:32:11.6216779Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3fb93d0>
2025-05-07T20:32:11.6217591Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6218115Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f164f1bc0>}
2025-05-07T20:32:11.6218897Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6219104Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1cb7d70>
2025-05-07T20:32:11.6219109Z 
2025-05-07T20:32:11.6219283Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6219568Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6219682Z                            module_map=module_map)
2025-05-07T20:32:11.6219856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6219961Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6220045Z E       ^
2025-05-07T20:32:11.6220419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6220423Z 
2025-05-07T20:32:11.6220857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6220866Z 
2025-05-07T20:32:11.6220981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6221215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6221297Z     T=128,
2025-05-07T20:32:11.6221383Z     D=5120,
2025-05-07T20:32:11.6221470Z     scale_ub=None,
2025-05-07T20:32:11.6221562Z     contiguous=False,
2025-05-07T20:32:11.6221656Z     compiled=True,
2025-05-07T20:32:11.6221734Z )
2025-05-07T20:32:11.6221963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6222150Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6222155Z 
2025-05-07T20:32:11.6222235Z     @given(
2025-05-07T20:32:11.6222360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6222473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6222599Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6222855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6222977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6223059Z     )
2025-05-07T20:32:11.6223321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6223422Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6223503Z         self,
2025-05-07T20:32:11.6223596Z         T: int,
2025-05-07T20:32:11.6223676Z         D: int,
2025-05-07T20:32:11.6223781Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6223888Z         contiguous: bool,
2025-05-07T20:32:11.6223979Z         compiled: bool,
2025-05-07T20:32:11.6224069Z     ) -> None:
2025-05-07T20:32:11.6224170Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6224248Z     
2025-05-07T20:32:11.6224431Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6224509Z     
2025-05-07T20:32:11.6224690Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6224835Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6224930Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6225016Z         x0 = x[:, :D]
2025-05-07T20:32:11.6225107Z         x1 = x[:, D:]
2025-05-07T20:32:11.6225185Z     
2025-05-07T20:32:11.6225274Z         if contiguous:
2025-05-07T20:32:11.6225379Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6225481Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6225558Z     
2025-05-07T20:32:11.6225654Z         if scale_ub is not None:
2025-05-07T20:32:11.6225770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6225912Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6225999Z             )
2025-05-07T20:32:11.6226086Z         else:
2025-05-07T20:32:11.6226184Z             scale_ub_tensor = None
2025-05-07T20:32:11.6226269Z     
2025-05-07T20:32:11.6226404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6226505Z             op = silu_mul_quant
2025-05-07T20:32:11.6226607Z             if compiled:
2025-05-07T20:32:11.6226716Z                 op = torch.compile(op)
2025-05-07T20:32:11.6226827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6226912Z     
2025-05-07T20:32:11.6227008Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6227012Z 
2025-05-07T20:32:11.6227123Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6227258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6227364Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6227475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6227857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6227956Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6228475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6228588Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6228974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6229209Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6229564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6229667Z     kernel = self.compile(
2025-05-07T20:32:11.6230064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6230252Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6230394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6230399Z 
2025-05-07T20:32:11.6230611Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f29a48dd0>
2025-05-07T20:32:11.6231509Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6232034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef38c7e20>}
2025-05-07T20:32:11.6232813Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6233015Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1c73930>
2025-05-07T20:32:11.6233020Z 
2025-05-07T20:32:11.6233195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6233479Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6233668Z                            module_map=module_map)
2025-05-07T20:32:11.6233836Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6233947Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6234028Z E       ^
2025-05-07T20:32:11.6234404Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6234409Z 
2025-05-07T20:32:11.6234839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6234844Z 
2025-05-07T20:32:11.6234953Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6235196Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6235277Z     T=128,
2025-05-07T20:32:11.6235364Z     D=7168,
2025-05-07T20:32:11.6235456Z     scale_ub=1200.0,
2025-05-07T20:32:11.6235548Z     contiguous=False,
2025-05-07T20:32:11.6235651Z     compiled=False,
2025-05-07T20:32:11.6235730Z )
2025-05-07T20:32:11.6235963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6236154Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6236159Z 
2025-05-07T20:32:11.6236240Z     @given(
2025-05-07T20:32:11.6236364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6236473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6236595Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6236723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6236841Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6236921Z     )
2025-05-07T20:32:11.6237183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6237282Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6237378Z         self,
2025-05-07T20:32:11.6237463Z         T: int,
2025-05-07T20:32:11.6237545Z         D: int,
2025-05-07T20:32:11.6237659Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6237755Z         contiguous: bool,
2025-05-07T20:32:11.6237853Z         compiled: bool,
2025-05-07T20:32:11.6237936Z     ) -> None:
2025-05-07T20:32:11.6238036Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6238120Z     
2025-05-07T20:32:11.6238298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6238375Z     
2025-05-07T20:32:11.6238476Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6238607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6238702Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6238794Z         x0 = x[:, :D]
2025-05-07T20:32:11.6238879Z         x1 = x[:, D:]
2025-05-07T20:32:11.6238956Z     
2025-05-07T20:32:11.6239052Z         if contiguous:
2025-05-07T20:32:11.6239148Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6239243Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6239333Z     
2025-05-07T20:32:11.6239534Z         if scale_ub is not None:
2025-05-07T20:32:11.6239655Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6239798Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6239879Z             )
2025-05-07T20:32:11.6239967Z         else:
2025-05-07T20:32:11.6240068Z             scale_ub_tensor = None
2025-05-07T20:32:11.6240228Z     
2025-05-07T20:32:11.6240379Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6240477Z             op = silu_mul_quant
2025-05-07T20:32:11.6240567Z             if compiled:
2025-05-07T20:32:11.6240686Z                 op = torch.compile(op)
2025-05-07T20:32:11.6240802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6240878Z     
2025-05-07T20:32:11.6248018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6248027Z 
2025-05-07T20:32:11.6248156Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6248435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6248561Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6248673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6249202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6249317Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6249695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6249943Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6250302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6250404Z     kernel = self.compile(
2025-05-07T20:32:11.6250814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6251012Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6251157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6251162Z 
2025-05-07T20:32:11.6251379Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f141b8d50>
2025-05-07T20:32:11.6252186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6252721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef38c6ac0>}
2025-05-07T20:32:11.6253496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6253715Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1eeeb30>
2025-05-07T20:32:11.6253720Z 
2025-05-07T20:32:11.6253896Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6254174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6254297Z                            module_map=module_map)
2025-05-07T20:32:11.6254469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6254581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6254664Z E       ^
2025-05-07T20:32:11.6255037Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6255042Z 
2025-05-07T20:32:11.6255481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6255490Z 
2025-05-07T20:32:11.6255601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6255924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6256009Z     T=128,
2025-05-07T20:32:11.6256092Z     D=5120,
2025-05-07T20:32:11.6256188Z     scale_ub=None,
2025-05-07T20:32:11.6256281Z     contiguous=False,
2025-05-07T20:32:11.6256371Z     compiled=False,
2025-05-07T20:32:11.6256462Z )
2025-05-07T20:32:11.6256691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6256875Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6256879Z 
2025-05-07T20:32:11.6256967Z     @given(
2025-05-07T20:32:11.6257094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6257208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6257332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6257458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6257663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6257751Z     )
2025-05-07T20:32:11.6258012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6258122Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6258205Z         self,
2025-05-07T20:32:11.6258287Z         T: int,
2025-05-07T20:32:11.6258376Z         D: int,
2025-05-07T20:32:11.6258480Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6258576Z         contiguous: bool,
2025-05-07T20:32:11.6258680Z         compiled: bool,
2025-05-07T20:32:11.6258764Z     ) -> None:
2025-05-07T20:32:11.6258874Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6258952Z     
2025-05-07T20:32:11.6259131Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6259215Z     
2025-05-07T20:32:11.6259312Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6259444Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6259554Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6259645Z         x0 = x[:, :D]
2025-05-07T20:32:11.6259732Z         x1 = x[:, D:]
2025-05-07T20:32:11.6259821Z     
2025-05-07T20:32:11.6259910Z         if contiguous:
2025-05-07T20:32:11.6260008Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6260110Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6260188Z     
2025-05-07T20:32:11.6260291Z         if scale_ub is not None:
2025-05-07T20:32:11.6260405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6260549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6260638Z             )
2025-05-07T20:32:11.6260722Z         else:
2025-05-07T20:32:11.6260825Z             scale_ub_tensor = None
2025-05-07T20:32:11.6260909Z     
2025-05-07T20:32:11.6261050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6261145Z             op = silu_mul_quant
2025-05-07T20:32:11.6261245Z             if compiled:
2025-05-07T20:32:11.6261357Z                 op = torch.compile(op)
2025-05-07T20:32:11.6261472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6261557Z     
2025-05-07T20:32:11.6261654Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6261658Z 
2025-05-07T20:32:11.6261769Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6261905Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6262012Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6262124Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6262644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6262748Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6263128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6263362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6263813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6263915Z     kernel = self.compile(
2025-05-07T20:32:11.6264312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6264503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6264638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6264642Z 
2025-05-07T20:32:11.6264862Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3354950>
2025-05-07T20:32:11.6265667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6266198Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2d11f80>}
2025-05-07T20:32:11.6267052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6267254Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1db2df0>
2025-05-07T20:32:11.6267259Z 
2025-05-07T20:32:11.6267440Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6267717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6267834Z                            module_map=module_map)
2025-05-07T20:32:11.6268012Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6268118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6268201Z E       ^
2025-05-07T20:32:11.6268588Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6268593Z 
2025-05-07T20:32:11.6269024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6269029Z 
2025-05-07T20:32:11.6269146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6269380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6269462Z     T=128,
2025-05-07T20:32:11.6269554Z     D=5120,
2025-05-07T20:32:11.6269645Z     scale_ub=1200.0,
2025-05-07T20:32:11.6269735Z     contiguous=True,
2025-05-07T20:32:11.6269832Z     compiled=False,
2025-05-07T20:32:11.6269911Z )
2025-05-07T20:32:11.6270153Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6270334Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.6270345Z 
2025-05-07T20:32:11.6270426Z     @given(
2025-05-07T20:32:11.6270565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6270670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6270792Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6270924Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6271045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6271138Z     )
2025-05-07T20:32:11.6271396Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6271497Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6271590Z         self,
2025-05-07T20:32:11.6271672Z         T: int,
2025-05-07T20:32:11.6271754Z         D: int,
2025-05-07T20:32:11.6271867Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6271966Z         contiguous: bool,
2025-05-07T20:32:11.6272062Z         compiled: bool,
2025-05-07T20:32:11.6272154Z     ) -> None:
2025-05-07T20:32:11.6272263Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6272342Z     
2025-05-07T20:32:11.6272608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6272687Z     
2025-05-07T20:32:11.6272786Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6272929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6273024Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6273118Z         x0 = x[:, :D]
2025-05-07T20:32:11.6273207Z         x1 = x[:, D:]
2025-05-07T20:32:11.6273286Z     
2025-05-07T20:32:11.6273385Z         if contiguous:
2025-05-07T20:32:11.6273483Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6273578Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6273666Z     
2025-05-07T20:32:11.6273763Z         if scale_ub is not None:
2025-05-07T20:32:11.6273876Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6274026Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6274107Z             )
2025-05-07T20:32:11.6274264Z         else:
2025-05-07T20:32:11.6274379Z             scale_ub_tensor = None
2025-05-07T20:32:11.6274457Z     
2025-05-07T20:32:11.6274602Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6274698Z             op = silu_mul_quant
2025-05-07T20:32:11.6274789Z             if compiled:
2025-05-07T20:32:11.6274907Z                 op = torch.compile(op)
2025-05-07T20:32:11.6275020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6275098Z     
2025-05-07T20:32:11.6275206Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6275210Z 
2025-05-07T20:32:11.6275313Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6275448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6275560Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6275664Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6276187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6276304Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6276678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6276923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6277282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6277382Z     kernel = self.compile(
2025-05-07T20:32:11.6277785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6277970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6278109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6278113Z 
2025-05-07T20:32:11.6278326Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f141ba0d0>
2025-05-07T20:32:11.6279136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6279671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab1b20>}
2025-05-07T20:32:11.6280573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6280786Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1d77630>
2025-05-07T20:32:11.6280790Z 
2025-05-07T20:32:11.6280963Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6281247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6281474Z                            module_map=module_map)
2025-05-07T20:32:11.6281646Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6281758Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6281840Z E       ^
2025-05-07T20:32:11.6282209Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6282214Z 
2025-05-07T20:32:11.6282655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6282659Z 
2025-05-07T20:32:11.6282770Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6283013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6283096Z     T=1,
2025-05-07T20:32:11.6283178Z     D=7168,
2025-05-07T20:32:11.6283274Z     scale_ub=1200.0,
2025-05-07T20:32:11.6283442Z     contiguous=True,
2025-05-07T20:32:11.6283531Z     compiled=True,
2025-05-07T20:32:11.6283622Z )
2025-05-07T20:32:11.6283852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6284025Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6284037Z 
2025-05-07T20:32:11.6284120Z     @given(
2025-05-07T20:32:11.6284246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6284357Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6284478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6284601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6284729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6284810Z     )
2025-05-07T20:32:11.6285065Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6285173Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6285256Z         self,
2025-05-07T20:32:11.6285344Z         T: int,
2025-05-07T20:32:11.6285434Z         D: int,
2025-05-07T20:32:11.6285545Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6285649Z         contiguous: bool,
2025-05-07T20:32:11.6285744Z         compiled: bool,
2025-05-07T20:32:11.6285829Z     ) -> None:
2025-05-07T20:32:11.6285939Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6286017Z     
2025-05-07T20:32:11.6286195Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6286281Z     
2025-05-07T20:32:11.6286381Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6286512Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6286618Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6286706Z         x0 = x[:, :D]
2025-05-07T20:32:11.6286791Z         x1 = x[:, D:]
2025-05-07T20:32:11.6286878Z     
2025-05-07T20:32:11.6286967Z         if contiguous:
2025-05-07T20:32:11.6287066Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6287173Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6287250Z     
2025-05-07T20:32:11.6287357Z         if scale_ub is not None:
2025-05-07T20:32:11.6287469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6287611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6287702Z             )
2025-05-07T20:32:11.6287784Z         else:
2025-05-07T20:32:11.6287884Z             scale_ub_tensor = None
2025-05-07T20:32:11.6287973Z     
2025-05-07T20:32:11.6288108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6288204Z             op = silu_mul_quant
2025-05-07T20:32:11.6288302Z             if compiled:
2025-05-07T20:32:11.6288409Z                 op = torch.compile(op)
2025-05-07T20:32:11.6288520Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6288608Z     
2025-05-07T20:32:11.6288704Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6288709Z 
2025-05-07T20:32:11.6288819Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6288962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6289155Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6289271Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6289655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6289754Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6290273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6290381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6290761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6290994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6291350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6291532Z     kernel = self.compile(
2025-05-07T20:32:11.6291933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6292126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6292265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6292270Z 
2025-05-07T20:32:11.6292486Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3476b50>
2025-05-07T20:32:11.6293297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6293824Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab2840>}
2025-05-07T20:32:11.6294612Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6294815Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1da4cf0>
2025-05-07T20:32:11.6294819Z 
2025-05-07T20:32:11.6294992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6295275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6295388Z                            module_map=module_map)
2025-05-07T20:32:11.6295563Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6295672Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6295753Z E       ^
2025-05-07T20:32:11.6296129Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6296138Z 
2025-05-07T20:32:11.6296572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6296576Z 
2025-05-07T20:32:11.6296693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6296927Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6297010Z     T=1,
2025-05-07T20:32:11.6297097Z     D=7168,
2025-05-07T20:32:11.6297185Z     scale_ub=1200.0,
2025-05-07T20:32:11.6297278Z     contiguous=False,
2025-05-07T20:32:11.6297372Z     compiled=True,
2025-05-07T20:32:11.6297450Z )
2025-05-07T20:32:11.6297679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6297861Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6297865Z 
2025-05-07T20:32:11.6297947Z     @given(
2025-05-07T20:32:11.6298079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6298191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6298408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6298541Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6298661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6298744Z     )
2025-05-07T20:32:11.6299010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6299111Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6299193Z         self,
2025-05-07T20:32:11.6299285Z         T: int,
2025-05-07T20:32:11.6299367Z         D: int,
2025-05-07T20:32:11.6299472Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6299576Z         contiguous: bool,
2025-05-07T20:32:11.6299667Z         compiled: bool,
2025-05-07T20:32:11.6299761Z     ) -> None:
2025-05-07T20:32:11.6299864Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6299942Z     
2025-05-07T20:32:11.6300126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6300281Z     
2025-05-07T20:32:11.6300383Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6300521Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6300615Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6300700Z         x0 = x[:, :D]
2025-05-07T20:32:11.6300793Z         x1 = x[:, D:]
2025-05-07T20:32:11.6300871Z     
2025-05-07T20:32:11.6300960Z         if contiguous:
2025-05-07T20:32:11.6301064Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6301158Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6301235Z     
2025-05-07T20:32:11.6301338Z         if scale_ub is not None:
2025-05-07T20:32:11.6301448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6301592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6301671Z             )
2025-05-07T20:32:11.6301750Z         else:
2025-05-07T20:32:11.6301853Z             scale_ub_tensor = None
2025-05-07T20:32:11.6301929Z     
2025-05-07T20:32:11.6302069Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6302175Z             op = silu_mul_quant
2025-05-07T20:32:11.6302263Z             if compiled:
2025-05-07T20:32:11.6302372Z                 op = torch.compile(op)
2025-05-07T20:32:11.6302481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6302556Z     
2025-05-07T20:32:11.6302657Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6302661Z 
2025-05-07T20:32:11.6302762Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6302895Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6303005Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6303107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6303486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6303591Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6304107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6304221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6304591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6304821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6305179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6305278Z     kernel = self.compile(
2025-05-07T20:32:11.6305677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6305859Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6305989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6305994Z 
2025-05-07T20:32:11.6306215Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f174fb8d0>
2025-05-07T20:32:11.6307094Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6307628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab1440>}
2025-05-07T20:32:11.6308399Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6308599Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1dc3df0>
2025-05-07T20:32:11.6308609Z 
2025-05-07T20:32:11.6308782Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6309163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6309284Z                            module_map=module_map)
2025-05-07T20:32:11.6309450Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6309555Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6309643Z E       ^
2025-05-07T20:32:11.6310010Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6310015Z 
2025-05-07T20:32:11.6310451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6310455Z 
2025-05-07T20:32:11.6310564Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6310796Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6310883Z     T=1,
2025-05-07T20:32:11.6310964Z     D=7168,
2025-05-07T20:32:11.6311055Z     scale_ub=None,
2025-05-07T20:32:11.6311152Z     contiguous=False,
2025-05-07T20:32:11.6311243Z     compiled=True,
2025-05-07T20:32:11.6311320Z )
2025-05-07T20:32:11.6311553Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6311723Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6311728Z 
2025-05-07T20:32:11.6311812Z     @given(
2025-05-07T20:32:11.6311936Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6312042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6312172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6312295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6312414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6312498Z     )
2025-05-07T20:32:11.6312753Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6312852Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6312944Z         self,
2025-05-07T20:32:11.6313031Z         T: int,
2025-05-07T20:32:11.6313121Z         D: int,
2025-05-07T20:32:11.6313224Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6313764Z         contiguous: bool,
2025-05-07T20:32:11.6313922Z         compiled: bool,
2025-05-07T20:32:11.6314007Z     ) -> None:
2025-05-07T20:32:11.6314106Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6314186Z     
2025-05-07T20:32:11.6314362Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6314437Z     
2025-05-07T20:32:11.6314539Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6314668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6314759Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6314850Z         x0 = x[:, :D]
2025-05-07T20:32:11.6314933Z         x1 = x[:, D:]
2025-05-07T20:32:11.6315010Z     
2025-05-07T20:32:11.6315103Z         if contiguous:
2025-05-07T20:32:11.6315195Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6315305Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6315605Z     
2025-05-07T20:32:11.6315704Z         if scale_ub is not None:
2025-05-07T20:32:11.6315819Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6315961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6316039Z             )
2025-05-07T20:32:11.6316122Z         else:
2025-05-07T20:32:11.6316219Z             scale_ub_tensor = None
2025-05-07T20:32:11.6316293Z     
2025-05-07T20:32:11.6316433Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6316526Z             op = silu_mul_quant
2025-05-07T20:32:11.6316613Z             if compiled:
2025-05-07T20:32:11.6316723Z                 op = torch.compile(op)
2025-05-07T20:32:11.6316831Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6316911Z     
2025-05-07T20:32:11.6317004Z         y_fp8, y_scale = fn()
2025-05-07T20:32:11.6317131Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:11.6317334Z     
2025-05-07T20:32:11.6317484Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6317589Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:11.6317700Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:11.6317827Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:11.6317972Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6318054Z     
2025-05-07T20:32:11.6318157Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:11.6318162Z 
2025-05-07T20:32:11.6318269Z moe/activation_test.py:126: 
2025-05-07T20:32:11.6318405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6318513Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:11.6318681Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:11.6319283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:11.6319399Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:11.6319778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6320010Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6320481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:11.6320749Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:11.6321140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:11.6321318Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:11.6321670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:11.6321760Z     fn()
2025-05-07T20:32:11.6322180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:11.6322265Z     self.fn.run(
2025-05-07T20:32:11.6322620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6322718Z     kernel = self.compile(
2025-05-07T20:32:11.6323110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6323301Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6323433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6323438Z 
2025-05-07T20:32:11.6323654Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3220850>
2025-05-07T20:32:11.6324539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6325069Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2ab3e20>}
2025-05-07T20:32:11.6325843Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6326043Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1fa3830>
2025-05-07T20:32:11.6326047Z 
2025-05-07T20:32:11.6326224Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6326500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6326611Z                            module_map=module_map)
2025-05-07T20:32:11.6326868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6326974Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:11.6327062Z E       ^
2025-05-07T20:32:11.6327430Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6327434Z 
2025-05-07T20:32:11.6327862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6327866Z 
2025-05-07T20:32:11.6327981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6328211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6328301Z     T=1,
2025-05-07T20:32:11.6328382Z     D=5120,
2025-05-07T20:32:11.6328470Z     scale_ub=1200.0,
2025-05-07T20:32:11.6328568Z     contiguous=False,
2025-05-07T20:32:11.6328657Z     compiled=True,
2025-05-07T20:32:11.6328741Z )
2025-05-07T20:32:11.6328979Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6329152Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6329156Z 
2025-05-07T20:32:11.6329236Z     @given(
2025-05-07T20:32:11.6329369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6329473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6329600Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6329723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6329844Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6329930Z     )
2025-05-07T20:32:11.6330184Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6330282Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6330370Z         self,
2025-05-07T20:32:11.6330455Z         T: int,
2025-05-07T20:32:11.6330535Z         D: int,
2025-05-07T20:32:11.6330649Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6330748Z         contiguous: bool,
2025-05-07T20:32:11.6330839Z         compiled: bool,
2025-05-07T20:32:11.6330926Z     ) -> None:
2025-05-07T20:32:11.6331025Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6331100Z     
2025-05-07T20:32:11.6331281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6331357Z     
2025-05-07T20:32:11.6331460Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6331591Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6331683Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6331771Z         x0 = x[:, :D]
2025-05-07T20:32:11.6331857Z         x1 = x[:, D:]
2025-05-07T20:32:11.6331933Z     
2025-05-07T20:32:11.6332030Z         if contiguous:
2025-05-07T20:32:11.6332125Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6332217Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6332298Z     
2025-05-07T20:32:11.6332392Z         if scale_ub is not None:
2025-05-07T20:32:11.6332507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6332737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6332820Z             )
2025-05-07T20:32:11.6332906Z         else:
2025-05-07T20:32:11.6333004Z             scale_ub_tensor = None
2025-05-07T20:32:11.6333079Z     
2025-05-07T20:32:11.6333218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6333312Z             op = silu_mul_quant
2025-05-07T20:32:11.6333401Z             if compiled:
2025-05-07T20:32:11.6333510Z                 op = torch.compile(op)
2025-05-07T20:32:11.6333619Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6333695Z     
2025-05-07T20:32:11.6333799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6333804Z 
2025-05-07T20:32:11.6333905Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6334046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6334229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6334337Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6334723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6334822Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6335334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6335441Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6335811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6336047Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6336399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6336497Z     kernel = self.compile(
2025-05-07T20:32:11.6336907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6337093Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6337225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6337234Z 
2025-05-07T20:32:11.6337446Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1fdfcd0>
2025-05-07T20:32:11.6338246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6338773Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef25e79c0>}
2025-05-07T20:32:11.6339543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6339752Z context = <triton._C.libtriton.ir.context object at 0x7f1ef23cc3f0>
2025-05-07T20:32:11.6339757Z 
2025-05-07T20:32:11.6339929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6340205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6340323Z                            module_map=module_map)
2025-05-07T20:32:11.6340491Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6340598Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6340685Z E       ^
2025-05-07T20:32:11.6341050Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6341055Z 
2025-05-07T20:32:11.6341487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6341576Z 
2025-05-07T20:32:11.6341688Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6341921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6342009Z     T=1,
2025-05-07T20:32:11.6342089Z     D=5120,
2025-05-07T20:32:11.6342185Z     scale_ub=1200.0,
2025-05-07T20:32:11.6342275Z     contiguous=False,
2025-05-07T20:32:11.6342364Z     compiled=False,
2025-05-07T20:32:11.6342449Z )
2025-05-07T20:32:11.6342677Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6342854Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6342858Z 
2025-05-07T20:32:11.6342945Z     @given(
2025-05-07T20:32:11.6343069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6343172Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6343299Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6343531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6343656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6343735Z     )
2025-05-07T20:32:11.6343990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6344094Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6344177Z         self,
2025-05-07T20:32:11.6344257Z         T: int,
2025-05-07T20:32:11.6344345Z         D: int,
2025-05-07T20:32:11.6344447Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6344540Z         contiguous: bool,
2025-05-07T20:32:11.6344639Z         compiled: bool,
2025-05-07T20:32:11.6344722Z     ) -> None:
2025-05-07T20:32:11.6344821Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6344906Z     
2025-05-07T20:32:11.6345079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6345156Z     
2025-05-07T20:32:11.6345264Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6345406Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6345505Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6345591Z         x0 = x[:, :D]
2025-05-07T20:32:11.6345676Z         x1 = x[:, D:]
2025-05-07T20:32:11.6345760Z     
2025-05-07T20:32:11.6345848Z         if contiguous:
2025-05-07T20:32:11.6345943Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6346041Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6346118Z     
2025-05-07T20:32:11.6346212Z         if scale_ub is not None:
2025-05-07T20:32:11.6346327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6346467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6346546Z             )
2025-05-07T20:32:11.6346632Z         else:
2025-05-07T20:32:11.6346730Z             scale_ub_tensor = None
2025-05-07T20:32:11.6346810Z     
2025-05-07T20:32:11.6346944Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6347043Z             op = silu_mul_quant
2025-05-07T20:32:11.6347137Z             if compiled:
2025-05-07T20:32:11.6347244Z                 op = torch.compile(op)
2025-05-07T20:32:11.6347354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6347436Z     
2025-05-07T20:32:11.6347530Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6347534Z 
2025-05-07T20:32:11.6347636Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6347773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6347878Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6347988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6348508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6348612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6349039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6349360Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6349716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6349821Z     kernel = self.compile(
2025-05-07T20:32:11.6350216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6350404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6350535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6350539Z 
2025-05-07T20:32:11.6350750Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f287bcad0>
2025-05-07T20:32:11.6351559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6352161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef381afc0>}
2025-05-07T20:32:11.6352937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6353136Z context = <triton._C.libtriton.ir.context object at 0x7f1ef18b05f0>
2025-05-07T20:32:11.6353140Z 
2025-05-07T20:32:11.6353317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6353593Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6353705Z                            module_map=module_map)
2025-05-07T20:32:11.6353878Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6353988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6354069Z E       ^
2025-05-07T20:32:11.6354444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6354449Z 
2025-05-07T20:32:11.6354879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6354884Z 
2025-05-07T20:32:11.6354998Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6355230Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6355311Z     T=16384,
2025-05-07T20:32:11.6355398Z     D=5120,
2025-05-07T20:32:11.6355485Z     scale_ub=1200.0,
2025-05-07T20:32:11.6355576Z     contiguous=False,
2025-05-07T20:32:11.6355670Z     compiled=True,
2025-05-07T20:32:11.6355746Z )
2025-05-07T20:32:11.6355973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6356170Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6356178Z 
2025-05-07T20:32:11.6356258Z     @given(
2025-05-07T20:32:11.6356390Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6356492Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6356611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6356739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6356857Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6356936Z     )
2025-05-07T20:32:11.6357201Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6357299Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6357381Z         self,
2025-05-07T20:32:11.6357470Z         T: int,
2025-05-07T20:32:11.6357550Z         D: int,
2025-05-07T20:32:11.6357659Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6357754Z         contiguous: bool,
2025-05-07T20:32:11.6357850Z         compiled: bool,
2025-05-07T20:32:11.6357943Z     ) -> None:
2025-05-07T20:32:11.6358124Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6358201Z     
2025-05-07T20:32:11.6358383Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6358461Z     
2025-05-07T20:32:11.6358558Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6358695Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6358790Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6358876Z         x0 = x[:, :D]
2025-05-07T20:32:11.6358971Z         x1 = x[:, D:]
2025-05-07T20:32:11.6359065Z     
2025-05-07T20:32:11.6359168Z         if contiguous:
2025-05-07T20:32:11.6359289Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6359387Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6359464Z     
2025-05-07T20:32:11.6359566Z         if scale_ub is not None:
2025-05-07T20:32:11.6359679Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6359899Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6359991Z             )
2025-05-07T20:32:11.6360071Z         else:
2025-05-07T20:32:11.6360252Z             scale_ub_tensor = None
2025-05-07T20:32:11.6360340Z     
2025-05-07T20:32:11.6360476Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6360570Z             op = silu_mul_quant
2025-05-07T20:32:11.6360667Z             if compiled:
2025-05-07T20:32:11.6360772Z                 op = torch.compile(op)
2025-05-07T20:32:11.6360881Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6360965Z     
2025-05-07T20:32:11.6361060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6361064Z 
2025-05-07T20:32:11.6361173Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6361307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6361414Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6361523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6361917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6362015Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6362531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6362633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6363010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6363241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6363593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6363697Z     kernel = self.compile(
2025-05-07T20:32:11.6364091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6364287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6364422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6364426Z 
2025-05-07T20:32:11.6364637Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f164e1c50>
2025-05-07T20:32:11.6365440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6365963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef3a93ce0>}
2025-05-07T20:32:11.6366736Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6367022Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1896570>
2025-05-07T20:32:11.6367027Z 
2025-05-07T20:32:11.6367203Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6367482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6367593Z                            module_map=module_map)
2025-05-07T20:32:11.6367766Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6367869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6367949Z E       ^
2025-05-07T20:32:11.6368324Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6368328Z 
2025-05-07T20:32:11.6368757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6368761Z 
2025-05-07T20:32:11.6368950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6369189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6369270Z     T=2048,
2025-05-07T20:32:11.6369354Z     D=7168,
2025-05-07T20:32:11.6369444Z     scale_ub=1200.0,
2025-05-07T20:32:11.6369539Z     contiguous=False,
2025-05-07T20:32:11.6369632Z     compiled=True,
2025-05-07T20:32:11.6369711Z )
2025-05-07T20:32:11.6369938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6370124Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6370129Z 
2025-05-07T20:32:11.6370209Z     @given(
2025-05-07T20:32:11.6370339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6370442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6370562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6370688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6370813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6370897Z     )
2025-05-07T20:32:11.6371156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6371254Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6371354Z         self,
2025-05-07T20:32:11.6371435Z         T: int,
2025-05-07T20:32:11.6371523Z         D: int,
2025-05-07T20:32:11.6371625Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6371719Z         contiguous: bool,
2025-05-07T20:32:11.6371816Z         compiled: bool,
2025-05-07T20:32:11.6371898Z     ) -> None:
2025-05-07T20:32:11.6371997Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6372084Z     
2025-05-07T20:32:11.6372261Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6372338Z     
2025-05-07T20:32:11.6372440Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6372574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6372673Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6372765Z         x0 = x[:, :D]
2025-05-07T20:32:11.6372854Z         x1 = x[:, D:]
2025-05-07T20:32:11.6372938Z     
2025-05-07T20:32:11.6373030Z         if contiguous:
2025-05-07T20:32:11.6373126Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6373229Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6373306Z     
2025-05-07T20:32:11.6373401Z         if scale_ub is not None:
2025-05-07T20:32:11.6373517Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6373658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6373738Z             )
2025-05-07T20:32:11.6373826Z         else:
2025-05-07T20:32:11.6373925Z             scale_ub_tensor = None
2025-05-07T20:32:11.6374002Z     
2025-05-07T20:32:11.6374145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6374241Z             op = silu_mul_quant
2025-05-07T20:32:11.6374329Z             if compiled:
2025-05-07T20:32:11.6374446Z                 op = torch.compile(op)
2025-05-07T20:32:11.6374665Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6374750Z     
2025-05-07T20:32:11.6382416Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6382428Z 
2025-05-07T20:32:11.6382552Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6382698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6382818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6382932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6383335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6383440Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6383956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6384070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6384443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6384803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6385168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6385271Z     kernel = self.compile(
2025-05-07T20:32:11.6385682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6385871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6386008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6386013Z 
2025-05-07T20:32:11.6386238Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f287bdad0>
2025-05-07T20:32:11.6387046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6387585Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1f29aac720>}
2025-05-07T20:32:11.6388359Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6388570Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1875930>
2025-05-07T20:32:11.6388574Z 
2025-05-07T20:32:11.6388750Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6389029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6389154Z                            module_map=module_map)
2025-05-07T20:32:11.6389330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6389442Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6389534Z E       ^
2025-05-07T20:32:11.6389904Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6389910Z 
2025-05-07T20:32:11.6390347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6390352Z 
2025-05-07T20:32:11.6390462Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6390696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6390789Z     T=1,
2025-05-07T20:32:11.6390871Z     D=5120,
2025-05-07T20:32:11.6390959Z     scale_ub=None,
2025-05-07T20:32:11.6391061Z     contiguous=False,
2025-05-07T20:32:11.6391152Z     compiled=False,
2025-05-07T20:32:11.6391243Z )
2025-05-07T20:32:11.6391473Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6391739Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6391744Z 
2025-05-07T20:32:11.6391836Z     @given(
2025-05-07T20:32:11.6391963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6392070Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6392201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6392324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6392444Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6392532Z     )
2025-05-07T20:32:11.6392790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6392901Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6392985Z         self,
2025-05-07T20:32:11.6393072Z         T: int,
2025-05-07T20:32:11.6393163Z         D: int,
2025-05-07T20:32:11.6393267Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6393440Z         contiguous: bool,
2025-05-07T20:32:11.6393545Z         compiled: bool,
2025-05-07T20:32:11.6393631Z     ) -> None:
2025-05-07T20:32:11.6393731Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6393819Z     
2025-05-07T20:32:11.6393998Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6394076Z     
2025-05-07T20:32:11.6394182Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6394318Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6394421Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6394508Z         x0 = x[:, :D]
2025-05-07T20:32:11.6394594Z         x1 = x[:, D:]
2025-05-07T20:32:11.6394679Z     
2025-05-07T20:32:11.6394769Z         if contiguous:
2025-05-07T20:32:11.6394869Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6394973Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6395050Z     
2025-05-07T20:32:11.6395146Z         if scale_ub is not None:
2025-05-07T20:32:11.6395268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6395422Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6395503Z             )
2025-05-07T20:32:11.6395590Z         else:
2025-05-07T20:32:11.6395689Z             scale_ub_tensor = None
2025-05-07T20:32:11.6395766Z     
2025-05-07T20:32:11.6395910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6396007Z             op = silu_mul_quant
2025-05-07T20:32:11.6396103Z             if compiled:
2025-05-07T20:32:11.6396209Z                 op = torch.compile(op)
2025-05-07T20:32:11.6396321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6396406Z     
2025-05-07T20:32:11.6396502Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6396507Z 
2025-05-07T20:32:11.6396610Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6396757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6396864Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6396974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6397505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6397610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6397992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6398226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6398583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6398690Z     kernel = self.compile(
2025-05-07T20:32:11.6399090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6399286Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6399420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6399430Z 
2025-05-07T20:32:11.6399728Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a504d0>
2025-05-07T20:32:11.6400662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6401189Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2d125c0>}
2025-05-07T20:32:11.6401967Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6402171Z context = <triton._C.libtriton.ir.context object at 0x7f1ef24a6630>
2025-05-07T20:32:11.6402250Z 
2025-05-07T20:32:11.6402432Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6402718Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6402834Z                            module_map=module_map)
2025-05-07T20:32:11.6403016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6403122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6403204Z E       ^
2025-05-07T20:32:11.6403581Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6403586Z 
2025-05-07T20:32:11.6404018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6404023Z 
2025-05-07T20:32:11.6404141Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6404379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6404469Z     T=4096,
2025-05-07T20:32:11.6404567Z     D=7168,
2025-05-07T20:32:11.6404658Z     scale_ub=1200.0,
2025-05-07T20:32:11.6404752Z     contiguous=False,
2025-05-07T20:32:11.6404848Z     compiled=False,
2025-05-07T20:32:11.6404931Z )
2025-05-07T20:32:11.6405161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6405356Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6405360Z 
2025-05-07T20:32:11.6405444Z     @given(
2025-05-07T20:32:11.6405582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6405692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6405814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6405945Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6406067Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6406148Z     )
2025-05-07T20:32:11.6406420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6406526Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6406609Z         self,
2025-05-07T20:32:11.6406700Z         T: int,
2025-05-07T20:32:11.6406782Z         D: int,
2025-05-07T20:32:11.6406890Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6406994Z         contiguous: bool,
2025-05-07T20:32:11.6407088Z         compiled: bool,
2025-05-07T20:32:11.6407179Z     ) -> None:
2025-05-07T20:32:11.6407280Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6407361Z     
2025-05-07T20:32:11.6407552Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6407631Z     
2025-05-07T20:32:11.6407729Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6407872Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6407966Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6408054Z         x0 = x[:, :D]
2025-05-07T20:32:11.6408148Z         x1 = x[:, D:]
2025-05-07T20:32:11.6408233Z     
2025-05-07T20:32:11.6408322Z         if contiguous:
2025-05-07T20:32:11.6408512Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6408625Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6408719Z     
2025-05-07T20:32:11.6408835Z         if scale_ub is not None:
2025-05-07T20:32:11.6408952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6409105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6409185Z             )
2025-05-07T20:32:11.6409266Z         else:
2025-05-07T20:32:11.6409374Z             scale_ub_tensor = None
2025-05-07T20:32:11.6409452Z     
2025-05-07T20:32:11.6409592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6409698Z             op = silu_mul_quant
2025-05-07T20:32:11.6409789Z             if compiled:
2025-05-07T20:32:11.6409895Z                 op = torch.compile(op)
2025-05-07T20:32:11.6410015Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6410170Z     
2025-05-07T20:32:11.6410274Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6410286Z 
2025-05-07T20:32:11.6410392Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6410528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6410644Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6410749Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6411265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6411381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6411754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6411993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6412347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6412454Z     kernel = self.compile(
2025-05-07T20:32:11.6412863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6413048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6413181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6413193Z 
2025-05-07T20:32:11.6413676Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f141b8fd0>
2025-05-07T20:32:11.6414548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6415087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f60900>}
2025-05-07T20:32:11.6415871Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6416081Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1e2c1f0>
2025-05-07T20:32:11.6416085Z 
2025-05-07T20:32:11.6416260Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6416535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6416657Z                            module_map=module_map)
2025-05-07T20:32:11.6416826Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6416931Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6417020Z E       ^
2025-05-07T20:32:11.6417389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6417398Z 
2025-05-07T20:32:11.6418083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6418088Z 
2025-05-07T20:32:11.6418201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6418435Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6418526Z     T=16384,
2025-05-07T20:32:11.6418608Z     D=7168,
2025-05-07T20:32:11.6418703Z     scale_ub=None,
2025-05-07T20:32:11.6418794Z     contiguous=True,
2025-05-07T20:32:11.6418883Z     compiled=True,
2025-05-07T20:32:11.6418969Z )
2025-05-07T20:32:11.6419198Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6419383Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6419387Z 
2025-05-07T20:32:11.6419475Z     @given(
2025-05-07T20:32:11.6419604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6419835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6419970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6420096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6420226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6420306Z     )
2025-05-07T20:32:11.6420565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6420674Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6420760Z         self,
2025-05-07T20:32:11.6420844Z         T: int,
2025-05-07T20:32:11.6420934Z         D: int,
2025-05-07T20:32:11.6421038Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6421134Z         contiguous: bool,
2025-05-07T20:32:11.6421235Z         compiled: bool,
2025-05-07T20:32:11.6421318Z     ) -> None:
2025-05-07T20:32:11.6421420Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6421505Z     
2025-05-07T20:32:11.6421682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6421774Z     
2025-05-07T20:32:11.6421876Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6422009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6422113Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6422200Z         x0 = x[:, :D]
2025-05-07T20:32:11.6422286Z         x1 = x[:, D:]
2025-05-07T20:32:11.6422370Z     
2025-05-07T20:32:11.6422460Z         if contiguous:
2025-05-07T20:32:11.6422562Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6422665Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6422743Z     
2025-05-07T20:32:11.6422840Z         if scale_ub is not None:
2025-05-07T20:32:11.6422959Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6423101Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6423185Z             )
2025-05-07T20:32:11.6423274Z         else:
2025-05-07T20:32:11.6423374Z             scale_ub_tensor = None
2025-05-07T20:32:11.6423460Z     
2025-05-07T20:32:11.6423600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6423701Z             op = silu_mul_quant
2025-05-07T20:32:11.6423797Z             if compiled:
2025-05-07T20:32:11.6423903Z                 op = torch.compile(op)
2025-05-07T20:32:11.6424014Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6424102Z     
2025-05-07T20:32:11.6424201Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6424205Z 
2025-05-07T20:32:11.6424309Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6424455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6424563Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6424673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6425057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6425157Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6425677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6425871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6426247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6426487Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6426843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6426950Z     kernel = self.compile(
2025-05-07T20:32:11.6427346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6427531Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6427676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6427680Z 
2025-05-07T20:32:11.6427894Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef22f5450>
2025-05-07T20:32:11.6428789Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6429317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f61c60>}
2025-05-07T20:32:11.6430086Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6430297Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1e34a30>
2025-05-07T20:32:11.6430301Z 
2025-05-07T20:32:11.6430477Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6430769Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6430885Z                            module_map=module_map)
2025-05-07T20:32:11.6431057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6431168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6431249Z E       ^
2025-05-07T20:32:11.6431626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6431631Z 
2025-05-07T20:32:11.6432063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6432067Z 
2025-05-07T20:32:11.6432177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6432417Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6432499Z     T=4096,
2025-05-07T20:32:11.6432579Z     D=5120,
2025-05-07T20:32:11.6432679Z     scale_ub=None,
2025-05-07T20:32:11.6432773Z     contiguous=False,
2025-05-07T20:32:11.6432871Z     compiled=True,
2025-05-07T20:32:11.6432950Z )
2025-05-07T20:32:11.6433178Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6433367Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6433371Z 
2025-05-07T20:32:11.6433452Z     @given(
2025-05-07T20:32:11.6433579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6433692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6433814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6433937Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6434065Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6434145Z     )
2025-05-07T20:32:11.6434410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6434512Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6434600Z         self,
2025-05-07T20:32:11.6434776Z         T: int,
2025-05-07T20:32:11.6434861Z         D: int,
2025-05-07T20:32:11.6434965Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6435071Z         contiguous: bool,
2025-05-07T20:32:11.6435163Z         compiled: bool,
2025-05-07T20:32:11.6435247Z     ) -> None:
2025-05-07T20:32:11.6435354Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6435434Z     
2025-05-07T20:32:11.6435615Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6435705Z     
2025-05-07T20:32:11.6435805Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6435946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6436044Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6436130Z         x0 = x[:, :D]
2025-05-07T20:32:11.6436226Z         x1 = x[:, D:]
2025-05-07T20:32:11.6436304Z     
2025-05-07T20:32:11.6436393Z         if contiguous:
2025-05-07T20:32:11.6436496Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6436672Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6436756Z     
2025-05-07T20:32:11.6436856Z         if scale_ub is not None:
2025-05-07T20:32:11.6436967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6437114Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6437194Z             )
2025-05-07T20:32:11.6437275Z         else:
2025-05-07T20:32:11.6437380Z             scale_ub_tensor = None
2025-05-07T20:32:11.6437456Z     
2025-05-07T20:32:11.6437592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6437692Z             op = silu_mul_quant
2025-05-07T20:32:11.6437781Z             if compiled:
2025-05-07T20:32:11.6437886Z                 op = torch.compile(op)
2025-05-07T20:32:11.6438001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6438077Z     
2025-05-07T20:32:11.6438172Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6438182Z 
2025-05-07T20:32:11.6438290Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6438427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6438538Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6438642Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6439026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6439131Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6439642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6439745Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6440186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6440422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6440785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6440897Z     kernel = self.compile(
2025-05-07T20:32:11.6441292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6441482Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6441614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6441618Z 
2025-05-07T20:32:11.6441838Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef25fc1d0>
2025-05-07T20:32:11.6442638Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6443180Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f62980>}
2025-05-07T20:32:11.6444053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6444259Z context = <triton._C.libtriton.ir.context object at 0x7f1ef21057f0>
2025-05-07T20:32:11.6444263Z 
2025-05-07T20:32:11.6444445Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6444722Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6444836Z                            module_map=module_map)
2025-05-07T20:32:11.6445016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6445121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6445210Z E       ^
2025-05-07T20:32:11.6445579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6445683Z 
2025-05-07T20:32:11.6446121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6446126Z 
2025-05-07T20:32:11.6446246Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6446480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6446570Z     T=4096,
2025-05-07T20:32:11.6446653Z     D=5120,
2025-05-07T20:32:11.6446745Z     scale_ub=1200.0,
2025-05-07T20:32:11.6446846Z     contiguous=False,
2025-05-07T20:32:11.6446937Z     compiled=False,
2025-05-07T20:32:11.6447016Z )
2025-05-07T20:32:11.6447250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6447441Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6447445Z 
2025-05-07T20:32:11.6447526Z     @given(
2025-05-07T20:32:11.6447665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6447783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6447910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6448037Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6448159Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6448246Z     )
2025-05-07T20:32:11.6448504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6448605Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6448692Z         self,
2025-05-07T20:32:11.6448775Z         T: int,
2025-05-07T20:32:11.6448858Z         D: int,
2025-05-07T20:32:11.6448969Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6449064Z         contiguous: bool,
2025-05-07T20:32:11.6449156Z         compiled: bool,
2025-05-07T20:32:11.6449246Z     ) -> None:
2025-05-07T20:32:11.6449345Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6449429Z     
2025-05-07T20:32:11.6449612Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6449693Z     
2025-05-07T20:32:11.6449795Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6449925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6450018Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6450109Z         x0 = x[:, :D]
2025-05-07T20:32:11.6450192Z         x1 = x[:, D:]
2025-05-07T20:32:11.6450269Z     
2025-05-07T20:32:11.6450364Z         if contiguous:
2025-05-07T20:32:11.6450461Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6450556Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6450639Z     
2025-05-07T20:32:11.6450734Z         if scale_ub is not None:
2025-05-07T20:32:11.6450844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6450992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6451071Z             )
2025-05-07T20:32:11.6451158Z         else:
2025-05-07T20:32:11.6451256Z             scale_ub_tensor = None
2025-05-07T20:32:11.6451339Z     
2025-05-07T20:32:11.6451565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6451662Z             op = silu_mul_quant
2025-05-07T20:32:11.6451752Z             if compiled:
2025-05-07T20:32:11.6451865Z                 op = torch.compile(op)
2025-05-07T20:32:11.6451976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6452053Z     
2025-05-07T20:32:11.6452155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6452159Z 
2025-05-07T20:32:11.6452261Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6452405Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6452512Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6452616Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6453139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6453242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6453701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6453940Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6454294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6454399Z     kernel = self.compile(
2025-05-07T20:32:11.6454796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6454981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6455117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6455121Z 
2025-05-07T20:32:11.6455334Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a05250>
2025-05-07T20:32:11.6456147Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6456677Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1f63ba0>}
2025-05-07T20:32:11.6457443Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6457652Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1abbd70>
2025-05-07T20:32:11.6457656Z 
2025-05-07T20:32:11.6457830Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6458112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6458230Z                            module_map=module_map)
2025-05-07T20:32:11.6458402Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6458513Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6458595Z E       ^
2025-05-07T20:32:11.6458964Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6458975Z 
2025-05-07T20:32:11.6459409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6459414Z 
2025-05-07T20:32:11.6459524Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6459763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6459844Z     T=4096,
2025-05-07T20:32:11.6459924Z     D=5120,
2025-05-07T20:32:11.6460019Z     scale_ub=1200.0,
2025-05-07T20:32:11.6460113Z     contiguous=False,
2025-05-07T20:32:11.6460201Z     compiled=True,
2025-05-07T20:32:11.6460292Z )
2025-05-07T20:32:11.6460599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6460791Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6460796Z 
2025-05-07T20:32:11.6460878Z     @given(
2025-05-07T20:32:11.6461004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6461115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6461236Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6461359Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6461485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6461565Z     )
2025-05-07T20:32:11.6461826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6461931Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6462013Z         self,
2025-05-07T20:32:11.6462103Z         T: int,
2025-05-07T20:32:11.6462261Z         D: int,
2025-05-07T20:32:11.6462365Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6462476Z         contiguous: bool,
2025-05-07T20:32:11.6462567Z         compiled: bool,
2025-05-07T20:32:11.6462651Z     ) -> None:
2025-05-07T20:32:11.6462757Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6462835Z     
2025-05-07T20:32:11.6463013Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6463099Z     
2025-05-07T20:32:11.6463196Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6463328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6463429Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6463513Z         x0 = x[:, :D]
2025-05-07T20:32:11.6463605Z         x1 = x[:, D:]
2025-05-07T20:32:11.6463681Z     
2025-05-07T20:32:11.6463770Z         if contiguous:
2025-05-07T20:32:11.6463871Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6463964Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6464041Z     
2025-05-07T20:32:11.6464150Z         if scale_ub is not None:
2025-05-07T20:32:11.6464268Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6464408Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6464495Z             )
2025-05-07T20:32:11.6464578Z         else:
2025-05-07T20:32:11.6464677Z             scale_ub_tensor = None
2025-05-07T20:32:11.6464762Z     
2025-05-07T20:32:11.6464899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6465001Z             op = silu_mul_quant
2025-05-07T20:32:11.6465091Z             if compiled:
2025-05-07T20:32:11.6465195Z                 op = torch.compile(op)
2025-05-07T20:32:11.6465319Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6465395Z     
2025-05-07T20:32:11.6465491Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6465495Z 
2025-05-07T20:32:11.6465603Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6465740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6465852Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6465970Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6466353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6466456Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6466970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6467073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6467451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6467685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6468040Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6468145Z     kernel = self.compile(
2025-05-07T20:32:11.6468633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6468826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6468959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6468963Z 
2025-05-07T20:32:11.6469178Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef2ab7350>
2025-05-07T20:32:11.6469987Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6470513Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2130ea0>}
2025-05-07T20:32:11.6471302Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6471579Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1a42d70>
2025-05-07T20:32:11.6471583Z 
2025-05-07T20:32:11.6471764Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6472040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6472152Z                            module_map=module_map)
2025-05-07T20:32:11.6472325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6472428Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6472510Z E       ^
2025-05-07T20:32:11.6472884Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6472889Z 
2025-05-07T20:32:11.6473324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6473334Z 
2025-05-07T20:32:11.6473447Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6473679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6473761Z     T=2048,
2025-05-07T20:32:11.6473849Z     D=7168,
2025-05-07T20:32:11.6473938Z     scale_ub=1200.0,
2025-05-07T20:32:11.6474029Z     contiguous=False,
2025-05-07T20:32:11.6474125Z     compiled=False,
2025-05-07T20:32:11.6474205Z )
2025-05-07T20:32:11.6474437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6474635Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6474640Z 
2025-05-07T20:32:11.6474721Z     @given(
2025-05-07T20:32:11.6474856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6474961Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6475088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6475220Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6475342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6475422Z     )
2025-05-07T20:32:11.6475686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6475785Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6475867Z         self,
2025-05-07T20:32:11.6475957Z         T: int,
2025-05-07T20:32:11.6476038Z         D: int,
2025-05-07T20:32:11.6476147Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6476242Z         contiguous: bool,
2025-05-07T20:32:11.6476333Z         compiled: bool,
2025-05-07T20:32:11.6476424Z     ) -> None:
2025-05-07T20:32:11.6476526Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6476604Z     
2025-05-07T20:32:11.6476788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6476867Z     
2025-05-07T20:32:11.6476970Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6477194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6477293Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6477379Z         x0 = x[:, :D]
2025-05-07T20:32:11.6477473Z         x1 = x[:, D:]
2025-05-07T20:32:11.6477551Z     
2025-05-07T20:32:11.6477647Z         if contiguous:
2025-05-07T20:32:11.6477744Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6477841Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6477924Z     
2025-05-07T20:32:11.6478021Z         if scale_ub is not None:
2025-05-07T20:32:11.6478133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6478284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6478366Z             )
2025-05-07T20:32:11.6478447Z         else:
2025-05-07T20:32:11.6478554Z             scale_ub_tensor = None
2025-05-07T20:32:11.6478634Z     
2025-05-07T20:32:11.6478770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6478976Z             op = silu_mul_quant
2025-05-07T20:32:11.6479074Z             if compiled:
2025-05-07T20:32:11.6479186Z                 op = torch.compile(op)
2025-05-07T20:32:11.6479300Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6479377Z     
2025-05-07T20:32:11.6479477Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6479481Z 
2025-05-07T20:32:11.6479584Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6479720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6479831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6479936Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6480564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6480673Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6481047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6481297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6481652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6481754Z     kernel = self.compile(
2025-05-07T20:32:11.6482156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6482340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6482479Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6482484Z 
2025-05-07T20:32:11.6482697Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243b1d0>
2025-05-07T20:32:11.6483498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6484040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2131940>}
2025-05-07T20:32:11.6484808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6485017Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1a267f0>
2025-05-07T20:32:11.6485021Z 
2025-05-07T20:32:11.6485194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6485469Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6485588Z                            module_map=module_map)
2025-05-07T20:32:11.6485757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6485873Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6486048Z E       ^
2025-05-07T20:32:11.6486420Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6486424Z 
2025-05-07T20:32:11.6486861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6486865Z 
2025-05-07T20:32:11.6486976Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6487215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6487296Z     T=1,
2025-05-07T20:32:11.6487378Z     D=7168,
2025-05-07T20:32:11.6487472Z     scale_ub=None,
2025-05-07T20:32:11.6487563Z     contiguous=True,
2025-05-07T20:32:11.6487653Z     compiled=False,
2025-05-07T20:32:11.6487740Z )
2025-05-07T20:32:11.6487969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6488224Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6488229Z 
2025-05-07T20:32:11.6488315Z     @given(
2025-05-07T20:32:11.6488440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6488548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6488675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6488798Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6488923Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6489002Z     )
2025-05-07T20:32:11.6489258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6489364Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6489445Z         self,
2025-05-07T20:32:11.6489527Z         T: int,
2025-05-07T20:32:11.6489616Z         D: int,
2025-05-07T20:32:11.6489720Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6489816Z         contiguous: bool,
2025-05-07T20:32:11.6489922Z         compiled: bool,
2025-05-07T20:32:11.6490009Z     ) -> None:
2025-05-07T20:32:11.6490108Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6490194Z     
2025-05-07T20:32:11.6490370Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6490455Z     
2025-05-07T20:32:11.6490551Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6490683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6490785Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6490871Z         x0 = x[:, :D]
2025-05-07T20:32:11.6490956Z         x1 = x[:, D:]
2025-05-07T20:32:11.6491041Z     
2025-05-07T20:32:11.6491129Z         if contiguous:
2025-05-07T20:32:11.6491225Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6491328Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6491405Z     
2025-05-07T20:32:11.6491501Z         if scale_ub is not None:
2025-05-07T20:32:11.6491618Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6491766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6491857Z             )
2025-05-07T20:32:11.6491940Z         else:
2025-05-07T20:32:11.6492039Z             scale_ub_tensor = None
2025-05-07T20:32:11.6492122Z     
2025-05-07T20:32:11.6492262Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6492357Z             op = silu_mul_quant
2025-05-07T20:32:11.6492457Z             if compiled:
2025-05-07T20:32:11.6492563Z                 op = torch.compile(op)
2025-05-07T20:32:11.6492674Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6492759Z     
2025-05-07T20:32:11.6492863Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6492867Z 
2025-05-07T20:32:11.6492970Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6493105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6493217Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6493323Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6493933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6494039Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6494413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6494653Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6495008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6495116Z     kernel = self.compile(
2025-05-07T20:32:11.6495513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6495698Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6495837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6495916Z 
2025-05-07T20:32:11.6496136Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a500d0>
2025-05-07T20:32:11.6496943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6497470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2132ca0>}
2025-05-07T20:32:11.6498238Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6498447Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1926670>
2025-05-07T20:32:11.6498451Z 
2025-05-07T20:32:11.6498630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6498917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6499031Z                            module_map=module_map)
2025-05-07T20:32:11.6499199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6499310Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6499392Z E       ^
2025-05-07T20:32:11.6499759Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6499770Z 
2025-05-07T20:32:11.6500202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6500207Z 
2025-05-07T20:32:11.6500316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6500554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6500642Z     T=16384,
2025-05-07T20:32:11.6500722Z     D=7168,
2025-05-07T20:32:11.6500820Z     scale_ub=1200.0,
2025-05-07T20:32:11.6500917Z     contiguous=False,
2025-05-07T20:32:11.6501004Z     compiled=True,
2025-05-07T20:32:11.6501088Z )
2025-05-07T20:32:11.6501320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6501511Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6501516Z 
2025-05-07T20:32:11.6501597Z     @given(
2025-05-07T20:32:11.6501723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6501834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6501954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6502077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6502203Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6502282Z     )
2025-05-07T20:32:11.6502539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6502735Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6502818Z         self,
2025-05-07T20:32:11.6502908Z         T: int,
2025-05-07T20:32:11.6502988Z         D: int,
2025-05-07T20:32:11.6503092Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6503193Z         contiguous: bool,
2025-05-07T20:32:11.6503285Z         compiled: bool,
2025-05-07T20:32:11.6503367Z     ) -> None:
2025-05-07T20:32:11.6503473Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6503550Z     
2025-05-07T20:32:11.6503725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6503807Z     
2025-05-07T20:32:11.6503904Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6504035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6504134Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6504220Z         x0 = x[:, :D]
2025-05-07T20:32:11.6504311Z         x1 = x[:, D:]
2025-05-07T20:32:11.6504388Z     
2025-05-07T20:32:11.6504572Z         if contiguous:
2025-05-07T20:32:11.6504674Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6504769Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6504854Z     
2025-05-07T20:32:11.6504950Z         if scale_ub is not None:
2025-05-07T20:32:11.6505062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6505210Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6505291Z             )
2025-05-07T20:32:11.6505371Z         else:
2025-05-07T20:32:11.6505477Z             scale_ub_tensor = None
2025-05-07T20:32:11.6505554Z     
2025-05-07T20:32:11.6505697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6505793Z             op = silu_mul_quant
2025-05-07T20:32:11.6505883Z             if compiled:
2025-05-07T20:32:11.6505995Z                 op = torch.compile(op)
2025-05-07T20:32:11.6506106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6506183Z     
2025-05-07T20:32:11.6506296Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6506300Z 
2025-05-07T20:32:11.6506407Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6506542Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6506655Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6506760Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6507150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6507250Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6507765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6507875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6508250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6508484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6508873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6508986Z     kernel = self.compile(
2025-05-07T20:32:11.6509410Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6509598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6509733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6509738Z 
2025-05-07T20:32:11.6509955Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f164e12d0>
2025-05-07T20:32:11.6518561Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6519435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2133f60>}
2025-05-07T20:32:11.6520342Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6520551Z context = <triton._C.libtriton.ir.context object at 0x7f1ef19a7bf0>
2025-05-07T20:32:11.6520556Z 
2025-05-07T20:32:11.6520746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6521024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6521138Z                            module_map=module_map)
2025-05-07T20:32:11.6521317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6521424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6521515Z E       ^
2025-05-07T20:32:11.6522027Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6522032Z 
2025-05-07T20:32:11.6522466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6522470Z 
2025-05-07T20:32:11.6522593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6522829Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6522920Z     T=1,
2025-05-07T20:32:11.6523002Z     D=7168,
2025-05-07T20:32:11.6523090Z     scale_ub=None,
2025-05-07T20:32:11.6523191Z     contiguous=False,
2025-05-07T20:32:11.6523282Z     compiled=False,
2025-05-07T20:32:11.6523362Z )
2025-05-07T20:32:11.6523598Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6523775Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6523786Z 
2025-05-07T20:32:11.6523869Z     @given(
2025-05-07T20:32:11.6524007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6524113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6524242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6524368Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6524489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6524579Z     )
2025-05-07T20:32:11.6524835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6524936Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6525028Z         self,
2025-05-07T20:32:11.6525113Z         T: int,
2025-05-07T20:32:11.6525194Z         D: int,
2025-05-07T20:32:11.6525308Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6525403Z         contiguous: bool,
2025-05-07T20:32:11.6525496Z         compiled: bool,
2025-05-07T20:32:11.6525591Z     ) -> None:
2025-05-07T20:32:11.6525699Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6525786Z     
2025-05-07T20:32:11.6525972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6526049Z     
2025-05-07T20:32:11.6526153Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6526284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6526378Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6526474Z         x0 = x[:, :D]
2025-05-07T20:32:11.6526559Z         x1 = x[:, D:]
2025-05-07T20:32:11.6526635Z     
2025-05-07T20:32:11.6526733Z         if contiguous:
2025-05-07T20:32:11.6526829Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6526927Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6527013Z     
2025-05-07T20:32:11.6527107Z         if scale_ub is not None:
2025-05-07T20:32:11.6527220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6527373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6527458Z             )
2025-05-07T20:32:11.6527553Z         else:
2025-05-07T20:32:11.6527733Z             scale_ub_tensor = None
2025-05-07T20:32:11.6527811Z     
2025-05-07T20:32:11.6527956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6528054Z             op = silu_mul_quant
2025-05-07T20:32:11.6528143Z             if compiled:
2025-05-07T20:32:11.6528264Z                 op = torch.compile(op)
2025-05-07T20:32:11.6528375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6528453Z     
2025-05-07T20:32:11.6528559Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6528564Z 
2025-05-07T20:32:11.6528665Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6528806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6528916Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6529025Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6529554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6529739Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6530112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6530356Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6530711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6530818Z     kernel = self.compile(
2025-05-07T20:32:11.6531218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6531409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6531924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6531928Z 
2025-05-07T20:32:11.6532143Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef22f6650>
2025-05-07T20:32:11.6532970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6533495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef19909a0>}
2025-05-07T20:32:11.6534266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6534474Z context = <triton._C.libtriton.ir.context object at 0x7f1ef16cff30>
2025-05-07T20:32:11.6534479Z 
2025-05-07T20:32:11.6534655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6534936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6535058Z                            module_map=module_map)
2025-05-07T20:32:11.6535227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6535341Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6535424Z E       ^
2025-05-07T20:32:11.6535801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6535806Z 
2025-05-07T20:32:11.6536237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6536242Z 
2025-05-07T20:32:11.6536351Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6536591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6536673Z     T=2048,
2025-05-07T20:32:11.6536754Z     D=7168,
2025-05-07T20:32:11.6536850Z     scale_ub=None,
2025-05-07T20:32:11.6536949Z     contiguous=False,
2025-05-07T20:32:11.6537047Z     compiled=True,
2025-05-07T20:32:11.6537209Z )
2025-05-07T20:32:11.6537440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6537632Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6537636Z 
2025-05-07T20:32:11.6537718Z     @given(
2025-05-07T20:32:11.6537844Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6537960Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6538083Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6538209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6538338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6538417Z     )
2025-05-07T20:32:11.6538684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6538784Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6538865Z         self,
2025-05-07T20:32:11.6539030Z         T: int,
2025-05-07T20:32:11.6539112Z         D: int,
2025-05-07T20:32:11.6539219Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6539323Z         contiguous: bool,
2025-05-07T20:32:11.6539414Z         compiled: bool,
2025-05-07T20:32:11.6539497Z     ) -> None:
2025-05-07T20:32:11.6539607Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6539685Z     
2025-05-07T20:32:11.6539865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6539951Z     
2025-05-07T20:32:11.6540049Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6540189Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6540284Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6540369Z         x0 = x[:, :D]
2025-05-07T20:32:11.6540463Z         x1 = x[:, D:]
2025-05-07T20:32:11.6540541Z     
2025-05-07T20:32:11.6540632Z         if contiguous:
2025-05-07T20:32:11.6540737Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6540837Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6540915Z     
2025-05-07T20:32:11.6541025Z         if scale_ub is not None:
2025-05-07T20:32:11.6541137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6541284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6541374Z             )
2025-05-07T20:32:11.6541456Z         else:
2025-05-07T20:32:11.6541558Z             scale_ub_tensor = None
2025-05-07T20:32:11.6541645Z     
2025-05-07T20:32:11.6541779Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6541882Z             op = silu_mul_quant
2025-05-07T20:32:11.6541971Z             if compiled:
2025-05-07T20:32:11.6542076Z                 op = torch.compile(op)
2025-05-07T20:32:11.6542194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6542271Z     
2025-05-07T20:32:11.6542366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6542370Z 
2025-05-07T20:32:11.6542480Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6542621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6542731Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6542843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6543227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6543333Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6543847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6543952Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6544335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6544569Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6544934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6545041Z     kernel = self.compile(
2025-05-07T20:32:11.6545524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6545719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6545855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6545860Z 
2025-05-07T20:32:11.6546077Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef2ab5c50>
2025-05-07T20:32:11.6546891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6547419Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1992160>}
2025-05-07T20:32:11.6548272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6548474Z context = <triton._C.libtriton.ir.context object at 0x7f1ef24c0230>
2025-05-07T20:32:11.6548478Z 
2025-05-07T20:32:11.6548658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6548936Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6549050Z                            module_map=module_map)
2025-05-07T20:32:11.6549225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6549328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6549409Z E       ^
2025-05-07T20:32:11.6549785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6549796Z 
2025-05-07T20:32:11.6550233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6550238Z 
2025-05-07T20:32:11.6550352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6550586Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6550667Z     T=4096,
2025-05-07T20:32:11.6550759Z     D=7168,
2025-05-07T20:32:11.6550845Z     scale_ub=None,
2025-05-07T20:32:11.6550937Z     contiguous=False,
2025-05-07T20:32:11.6551033Z     compiled=True,
2025-05-07T20:32:11.6551111Z )
2025-05-07T20:32:11.6551347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6551532Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6551536Z 
2025-05-07T20:32:11.6551616Z     @given(
2025-05-07T20:32:11.6551753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6551870Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6551995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6552127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6552247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6552327Z     )
2025-05-07T20:32:11.6552591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6552692Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6552781Z         self,
2025-05-07T20:32:11.6552862Z         T: int,
2025-05-07T20:32:11.6552944Z         D: int,
2025-05-07T20:32:11.6553056Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6553150Z         contiguous: bool,
2025-05-07T20:32:11.6553243Z         compiled: bool,
2025-05-07T20:32:11.6553334Z     ) -> None:
2025-05-07T20:32:11.6553435Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6553513Z     
2025-05-07T20:32:11.6553698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6553784Z     
2025-05-07T20:32:11.6553997Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6554138Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6554234Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6554332Z         x0 = x[:, :D]
2025-05-07T20:32:11.6554417Z         x1 = x[:, D:]
2025-05-07T20:32:11.6554494Z     
2025-05-07T20:32:11.6554590Z         if contiguous:
2025-05-07T20:32:11.6554688Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6554783Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6554871Z     
2025-05-07T20:32:11.6554966Z         if scale_ub is not None:
2025-05-07T20:32:11.6555081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6555232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6555312Z             )
2025-05-07T20:32:11.6555393Z         else:
2025-05-07T20:32:11.6555500Z             scale_ub_tensor = None
2025-05-07T20:32:11.6555577Z     
2025-05-07T20:32:11.6555792Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6555902Z             op = silu_mul_quant
2025-05-07T20:32:11.6555991Z             if compiled:
2025-05-07T20:32:11.6556105Z                 op = torch.compile(op)
2025-05-07T20:32:11.6556216Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6556293Z     
2025-05-07T20:32:11.6556398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6556403Z 
2025-05-07T20:32:11.6556506Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6556644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6556759Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6556865Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6557254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6557354Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6557871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6557990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6558363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6558600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6558963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6559063Z     kernel = self.compile(
2025-05-07T20:32:11.6559467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6559655Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6559792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6559797Z 
2025-05-07T20:32:11.6560021Z self = <triton.compiler.compiler.ASTSource object at 0x7f1f164e1a50>
2025-05-07T20:32:11.6560954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6561487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1992e80>}
2025-05-07T20:32:11.6562259Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6562460Z context = <triton._C.libtriton.ir.context object at 0x7f1ef245d030>
2025-05-07T20:32:11.6562470Z 
2025-05-07T20:32:11.6562643Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6563009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6563136Z                            module_map=module_map)
2025-05-07T20:32:11.6563304Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6563409Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6563496Z E       ^
2025-05-07T20:32:11.6563867Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6563872Z 
2025-05-07T20:32:11.6564315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6564319Z 
2025-05-07T20:32:11.6564428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6564661Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6564749Z     T=16384,
2025-05-07T20:32:11.6564829Z     D=5120,
2025-05-07T20:32:11.6564995Z     scale_ub=1200.0,
2025-05-07T20:32:11.6565097Z     contiguous=False,
2025-05-07T20:32:11.6565189Z     compiled=False,
2025-05-07T20:32:11.6565268Z )
2025-05-07T20:32:11.6565504Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6565695Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6565699Z 
2025-05-07T20:32:11.6565786Z     @given(
2025-05-07T20:32:11.6565914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6566019Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6566146Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6566269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6566388Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6566473Z     )
2025-05-07T20:32:11.6566730Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6566834Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6566923Z         self,
2025-05-07T20:32:11.6567009Z         T: int,
2025-05-07T20:32:11.6567098Z         D: int,
2025-05-07T20:32:11.6567205Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6567300Z         contiguous: bool,
2025-05-07T20:32:11.6567402Z         compiled: bool,
2025-05-07T20:32:11.6567485Z     ) -> None:
2025-05-07T20:32:11.6567587Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6567671Z     
2025-05-07T20:32:11.6567847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6567924Z     
2025-05-07T20:32:11.6568032Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6568164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6568260Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6568357Z         x0 = x[:, :D]
2025-05-07T20:32:11.6568442Z         x1 = x[:, D:]
2025-05-07T20:32:11.6568532Z     
2025-05-07T20:32:11.6568637Z         if contiguous:
2025-05-07T20:32:11.6568749Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6568879Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6568957Z     
2025-05-07T20:32:11.6569053Z         if scale_ub is not None:
2025-05-07T20:32:11.6569172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6569314Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6569395Z             )
2025-05-07T20:32:11.6569486Z         else:
2025-05-07T20:32:11.6569585Z             scale_ub_tensor = None
2025-05-07T20:32:11.6569662Z     
2025-05-07T20:32:11.6569810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6569904Z             op = silu_mul_quant
2025-05-07T20:32:11.6569998Z             if compiled:
2025-05-07T20:32:11.6570101Z                 op = torch.compile(op)
2025-05-07T20:32:11.6570210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6570292Z     
2025-05-07T20:32:11.6570386Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6570390Z 
2025-05-07T20:32:11.6570496Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6570724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6570831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6570940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6571456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6571557Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6571934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6572164Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6572517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6572622Z     kernel = self.compile(
2025-05-07T20:32:11.6573022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6573297Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6573429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6573433Z 
2025-05-07T20:32:11.6573649Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef3a04050>
2025-05-07T20:32:11.6574457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6574982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169c220>}
2025-05-07T20:32:11.6575761Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6575969Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1527a70>
2025-05-07T20:32:11.6575973Z 
2025-05-07T20:32:11.6576152Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6576432Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6576545Z                            module_map=module_map)
2025-05-07T20:32:11.6576717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6576823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6576904Z E       ^
2025-05-07T20:32:11.6577279Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6577283Z 
2025-05-07T20:32:11.6577712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6577722Z 
2025-05-07T20:32:11.6577841Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6578073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6578155Z     T=16384,
2025-05-07T20:32:11.6578245Z     D=5120,
2025-05-07T20:32:11.6578333Z     scale_ub=1200.0,
2025-05-07T20:32:11.6578423Z     contiguous=True,
2025-05-07T20:32:11.6578518Z     compiled=True,
2025-05-07T20:32:11.6578596Z )
2025-05-07T20:32:11.6578825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6579018Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6579022Z 
2025-05-07T20:32:11.6579102Z     @given(
2025-05-07T20:32:11.6579234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6579338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6579458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6579678Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6579800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6579877Z     )
2025-05-07T20:32:11.6580140Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6580238Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6580318Z         self,
2025-05-07T20:32:11.6580408Z         T: int,
2025-05-07T20:32:11.6580487Z         D: int,
2025-05-07T20:32:11.6580596Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6580694Z         contiguous: bool,
2025-05-07T20:32:11.6580787Z         compiled: bool,
2025-05-07T20:32:11.6580875Z     ) -> None:
2025-05-07T20:32:11.6580975Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6581051Z     
2025-05-07T20:32:11.6581232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6581310Z     
2025-05-07T20:32:11.6581406Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6581652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6581750Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6581835Z         x0 = x[:, :D]
2025-05-07T20:32:11.6581926Z         x1 = x[:, D:]
2025-05-07T20:32:11.6582003Z     
2025-05-07T20:32:11.6582098Z         if contiguous:
2025-05-07T20:32:11.6582195Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6582288Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6582369Z     
2025-05-07T20:32:11.6582462Z         if scale_ub is not None:
2025-05-07T20:32:11.6582572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6582716Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6582795Z             )
2025-05-07T20:32:11.6582875Z         else:
2025-05-07T20:32:11.6582978Z             scale_ub_tensor = None
2025-05-07T20:32:11.6583057Z     
2025-05-07T20:32:11.6583192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6583290Z             op = silu_mul_quant
2025-05-07T20:32:11.6583384Z             if compiled:
2025-05-07T20:32:11.6583500Z                 op = torch.compile(op)
2025-05-07T20:32:11.6583609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6583685Z     
2025-05-07T20:32:11.6583784Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6583792Z 
2025-05-07T20:32:11.6583895Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6584026Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6584138Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6584241Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6584621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6584726Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6585236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6585350Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6585724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6585956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6586317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6586415Z     kernel = self.compile(
2025-05-07T20:32:11.6586816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6587000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6587131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6587137Z 
2025-05-07T20:32:11.6587355Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1aff5d0>
2025-05-07T20:32:11.6588248Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6588791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169d4e0>}
2025-05-07T20:32:11.6589560Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6589761Z context = <triton._C.libtriton.ir.context object at 0x7f1ef16260b0>
2025-05-07T20:32:11.6589766Z 
2025-05-07T20:32:11.6589944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6590220Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6590412Z                            module_map=module_map)
2025-05-07T20:32:11.6590587Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6590690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6590777Z E       ^
2025-05-07T20:32:11.6591145Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6591150Z 
2025-05-07T20:32:11.6591581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6591595Z 
2025-05-07T20:32:11.6591703Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6591937Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6592024Z     T=16384,
2025-05-07T20:32:11.6592105Z     D=5120,
2025-05-07T20:32:11.6592196Z     scale_ub=None,
2025-05-07T20:32:11.6592297Z     contiguous=False,
2025-05-07T20:32:11.6592389Z     compiled=True,
2025-05-07T20:32:11.6592468Z )
2025-05-07T20:32:11.6592707Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6592892Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6592896Z 
2025-05-07T20:32:11.6592982Z     @given(
2025-05-07T20:32:11.6593107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6593210Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6593336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6593460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6593578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6593663Z     )
2025-05-07T20:32:11.6593920Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6594019Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6594106Z         self,
2025-05-07T20:32:11.6594191Z         T: int,
2025-05-07T20:32:11.6594276Z         D: int,
2025-05-07T20:32:11.6594390Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6594485Z         contiguous: bool,
2025-05-07T20:32:11.6594581Z         compiled: bool,
2025-05-07T20:32:11.6594664Z     ) -> None:
2025-05-07T20:32:11.6594766Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6594849Z     
2025-05-07T20:32:11.6595023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6595100Z     
2025-05-07T20:32:11.6595203Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6595333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6595427Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6595516Z         x0 = x[:, :D]
2025-05-07T20:32:11.6595601Z         x1 = x[:, D:]
2025-05-07T20:32:11.6595679Z     
2025-05-07T20:32:11.6595774Z         if contiguous:
2025-05-07T20:32:11.6595868Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6595970Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6596051Z     
2025-05-07T20:32:11.6596147Z         if scale_ub is not None:
2025-05-07T20:32:11.6596350Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6596492Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6596570Z             )
2025-05-07T20:32:11.6596655Z         else:
2025-05-07T20:32:11.6596753Z             scale_ub_tensor = None
2025-05-07T20:32:11.6596828Z     
2025-05-07T20:32:11.6596975Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6597069Z             op = silu_mul_quant
2025-05-07T20:32:11.6597159Z             if compiled:
2025-05-07T20:32:11.6597269Z                 op = torch.compile(op)
2025-05-07T20:32:11.6597377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6597461Z     
2025-05-07T20:32:11.6597559Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6597563Z 
2025-05-07T20:32:11.6597664Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6597801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6597990Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6598098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6598489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6598587Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6599098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6599206Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6599577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6599815Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6600286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6600389Z     kernel = self.compile(
2025-05-07T20:32:11.6600801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6600984Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6601122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6601127Z 
2025-05-07T20:32:11.6601338Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef24fd0d0>
2025-05-07T20:32:11.6602138Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6602667Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169e2a0>}
2025-05-07T20:32:11.6603440Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6603651Z context = <triton._C.libtriton.ir.context object at 0x7f1ef164cdb0>
2025-05-07T20:32:11.6603655Z 
2025-05-07T20:32:11.6603828Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6604104Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6604223Z                            module_map=module_map)
2025-05-07T20:32:11.6604391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6604501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6604582Z E       ^
2025-05-07T20:32:11.6604948Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6604953Z 
2025-05-07T20:32:11.6605478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6605483Z 
2025-05-07T20:32:11.6605592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6605832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6605912Z     T=2048,
2025-05-07T20:32:11.6605994Z     D=5120,
2025-05-07T20:32:11.6606085Z     scale_ub=None,
2025-05-07T20:32:11.6606176Z     contiguous=False,
2025-05-07T20:32:11.6606262Z     compiled=True,
2025-05-07T20:32:11.6606344Z )
2025-05-07T20:32:11.6606571Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6606753Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.6606757Z 
2025-05-07T20:32:11.6606841Z     @given(
2025-05-07T20:32:11.6606965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6607075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6607276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6607402Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6607528Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6607607Z     )
2025-05-07T20:32:11.6607864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6607969Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6608050Z         self,
2025-05-07T20:32:11.6608130Z         T: int,
2025-05-07T20:32:11.6608216Z         D: int,
2025-05-07T20:32:11.6608317Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6608409Z         contiguous: bool,
2025-05-07T20:32:11.6608506Z         compiled: bool,
2025-05-07T20:32:11.6608589Z     ) -> None:
2025-05-07T20:32:11.6608694Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6608771Z     
2025-05-07T20:32:11.6608950Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6609039Z     
2025-05-07T20:32:11.6609135Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6609268Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6609367Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6609451Z         x0 = x[:, :D]
2025-05-07T20:32:11.6609534Z         x1 = x[:, D:]
2025-05-07T20:32:11.6609620Z     
2025-05-07T20:32:11.6609710Z         if contiguous:
2025-05-07T20:32:11.6609804Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6609903Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6609980Z     
2025-05-07T20:32:11.6610075Z         if scale_ub is not None:
2025-05-07T20:32:11.6610193Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6610334Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6610421Z             )
2025-05-07T20:32:11.6610504Z         else:
2025-05-07T20:32:11.6610605Z             scale_ub_tensor = None
2025-05-07T20:32:11.6610691Z     
2025-05-07T20:32:11.6610826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6610925Z             op = silu_mul_quant
2025-05-07T20:32:11.6611027Z             if compiled:
2025-05-07T20:32:11.6611131Z                 op = torch.compile(op)
2025-05-07T20:32:11.6611240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6611324Z     
2025-05-07T20:32:11.6611421Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6611425Z 
2025-05-07T20:32:11.6611536Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6611671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6611775Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6611886Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6612269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6612368Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6612887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6613080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6613776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6614061Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6614418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6614522Z     kernel = self.compile(
2025-05-07T20:32:11.6614918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6615101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6615242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6615246Z 
2025-05-07T20:32:11.6615459Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243bcd0>
2025-05-07T20:32:11.6616537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6617061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef169f560>}
2025-05-07T20:32:11.6617832Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6618032Z context = <triton._C.libtriton.ir.context object at 0x7f1ef15738f0>
2025-05-07T20:32:11.6618037Z 
2025-05-07T20:32:11.6618208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6618490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6618611Z                            module_map=module_map)
2025-05-07T20:32:11.6618785Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6618888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6618968Z E       ^
2025-05-07T20:32:11.6619341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6619346Z 
2025-05-07T20:32:11.6619775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6619779Z 
2025-05-07T20:32:11.6619888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6620123Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6620204Z     T=2048,
2025-05-07T20:32:11.6620290Z     D=5120,
2025-05-07T20:32:11.6620377Z     scale_ub=1200.0,
2025-05-07T20:32:11.6620475Z     contiguous=False,
2025-05-07T20:32:11.6620568Z     compiled=True,
2025-05-07T20:32:11.6620652Z )
2025-05-07T20:32:11.6620880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6621070Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6621074Z 
2025-05-07T20:32:11.6621154Z     @given(
2025-05-07T20:32:11.6621277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6621386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6621506Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6621633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6621751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6621830Z     )
2025-05-07T20:32:11.6622091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6622189Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6622274Z         self,
2025-05-07T20:32:11.6622360Z         T: int,
2025-05-07T20:32:11.6622576Z         D: int,
2025-05-07T20:32:11.6622682Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6622781Z         contiguous: bool,
2025-05-07T20:32:11.6622870Z         compiled: bool,
2025-05-07T20:32:11.6622954Z     ) -> None:
2025-05-07T20:32:11.6623057Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6623134Z     
2025-05-07T20:32:11.6623319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6623398Z     
2025-05-07T20:32:11.6623495Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6623633Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6623725Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6623809Z         x0 = x[:, :D]
2025-05-07T20:32:11.6623899Z         x1 = x[:, D:]
2025-05-07T20:32:11.6623975Z     
2025-05-07T20:32:11.6624062Z         if contiguous:
2025-05-07T20:32:11.6624166Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6624337Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6624414Z     
2025-05-07T20:32:11.6624524Z         if scale_ub is not None:
2025-05-07T20:32:11.6624635Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6624788Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6624868Z             )
2025-05-07T20:32:11.6624948Z         else:
2025-05-07T20:32:11.6625055Z             scale_ub_tensor = None
2025-05-07T20:32:11.6625131Z     
2025-05-07T20:32:11.6625268Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6625374Z             op = silu_mul_quant
2025-05-07T20:32:11.6625462Z             if compiled:
2025-05-07T20:32:11.6625567Z                 op = torch.compile(op)
2025-05-07T20:32:11.6625687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6625763Z     
2025-05-07T20:32:11.6625857Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6625862Z 
2025-05-07T20:32:11.6625971Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6626118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6626232Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6626337Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6626719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6626822Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6627339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6627443Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6627814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6628055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6628408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6628519Z     kernel = self.compile(
2025-05-07T20:32:11.6628919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6629102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6629241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6629246Z 
2025-05-07T20:32:11.6629459Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef24fcad0>
2025-05-07T20:32:11.6630265Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6630788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1534c20>}
2025-05-07T20:32:11.6632221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6632438Z context = <triton._C.libtriton.ir.context object at 0x7f1ef15cd370>
2025-05-07T20:32:11.6632443Z 
2025-05-07T20:32:11.6632615Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6632896Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6633009Z                            module_map=module_map)
2025-05-07T20:32:11.6633178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6633286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6633367Z E       ^
2025-05-07T20:32:11.6633741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6633822Z 
2025-05-07T20:32:11.6634261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6634266Z 
2025-05-07T20:32:11.6634375Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6634612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6634693Z     T=4096,
2025-05-07T20:32:11.6634771Z     D=5120,
2025-05-07T20:32:11.6634865Z     scale_ub=1200.0,
2025-05-07T20:32:11.6634952Z     contiguous=True,
2025-05-07T20:32:11.6635045Z     compiled=True,
2025-05-07T20:32:11.6635125Z )
2025-05-07T20:32:11.6635352Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6635537Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6635541Z 
2025-05-07T20:32:11.6635620Z     @given(
2025-05-07T20:32:11.6635743Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6635860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6635986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6636107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6636233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6636312Z     )
2025-05-07T20:32:11.6636572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6636670Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6636750Z         self,
2025-05-07T20:32:11.6636836Z         T: int,
2025-05-07T20:32:11.6636915Z         D: int,
2025-05-07T20:32:11.6637020Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6637119Z         contiguous: bool,
2025-05-07T20:32:11.6637209Z         compiled: bool,
2025-05-07T20:32:11.6637290Z     ) -> None:
2025-05-07T20:32:11.6637395Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6637475Z     
2025-05-07T20:32:11.6637651Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6637740Z     
2025-05-07T20:32:11.6637840Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6637981Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6638074Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6638157Z         x0 = x[:, :D]
2025-05-07T20:32:11.6638245Z         x1 = x[:, D:]
2025-05-07T20:32:11.6638322Z     
2025-05-07T20:32:11.6638409Z         if contiguous:
2025-05-07T20:32:11.6638510Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6638604Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6638679Z     
2025-05-07T20:32:11.6638779Z         if scale_ub is not None:
2025-05-07T20:32:11.6638888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6639052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6639130Z             )
2025-05-07T20:32:11.6639218Z         else:
2025-05-07T20:32:11.6639317Z             scale_ub_tensor = None
2025-05-07T20:32:11.6639403Z     
2025-05-07T20:32:11.6639701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6639799Z             op = silu_mul_quant
2025-05-07T20:32:11.6639887Z             if compiled:
2025-05-07T20:32:11.6639999Z                 op = torch.compile(op)
2025-05-07T20:32:11.6640225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6640302Z     
2025-05-07T20:32:11.6640403Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6640408Z 
2025-05-07T20:32:11.6640509Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6640656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6640765Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6640869Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6641259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6641357Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6641875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6642064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6642437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6642680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6643036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6643135Z     kernel = self.compile(
2025-05-07T20:32:11.6643545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6643731Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6643864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6643876Z 
2025-05-07T20:32:11.6644093Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef2ab56d0>
2025-05-07T20:32:11.6644901Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6645430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1535a80>}
2025-05-07T20:32:11.6646198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6646405Z context = <triton._C.libtriton.ir.context object at 0x7f1ef14cd230>
2025-05-07T20:32:11.6646409Z 
2025-05-07T20:32:11.6646579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6654204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6654354Z                            module_map=module_map)
2025-05-07T20:32:11.6654530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6654635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6654724Z E       ^
2025-05-07T20:32:11.6655102Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6655107Z 
2025-05-07T20:32:11.6655548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6655562Z 
2025-05-07T20:32:11.6655672Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6655908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6656000Z     T=128,
2025-05-07T20:32:11.6656088Z     D=5120,
2025-05-07T20:32:11.6656178Z     scale_ub=1200.0,
2025-05-07T20:32:11.6656415Z     contiguous=False,
2025-05-07T20:32:11.6656506Z     compiled=True,
2025-05-07T20:32:11.6656586Z )
2025-05-07T20:32:11.6656824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6657006Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6657011Z 
2025-05-07T20:32:11.6657100Z     @given(
2025-05-07T20:32:11.6657227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6657333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6657463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6657587Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6657712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6657801Z     )
2025-05-07T20:32:11.6658059Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6658237Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6658328Z         self,
2025-05-07T20:32:11.6658414Z         T: int,
2025-05-07T20:32:11.6658495Z         D: int,
2025-05-07T20:32:11.6658607Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6658712Z         contiguous: bool,
2025-05-07T20:32:11.6658825Z         compiled: bool,
2025-05-07T20:32:11.6658925Z     ) -> None:
2025-05-07T20:32:11.6659035Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6659121Z     
2025-05-07T20:32:11.6659299Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6659378Z     
2025-05-07T20:32:11.6659485Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6659616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6659712Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6659810Z         x0 = x[:, :D]
2025-05-07T20:32:11.6659895Z         x1 = x[:, D:]
2025-05-07T20:32:11.6659974Z     
2025-05-07T20:32:11.6660074Z         if contiguous:
2025-05-07T20:32:11.6660180Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6660289Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6660369Z     
2025-05-07T20:32:11.6660465Z         if scale_ub is not None:
2025-05-07T20:32:11.6660584Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6660727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6660808Z             )
2025-05-07T20:32:11.6660897Z         else:
2025-05-07T20:32:11.6660995Z             scale_ub_tensor = None
2025-05-07T20:32:11.6661072Z     
2025-05-07T20:32:11.6661213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6661309Z             op = silu_mul_quant
2025-05-07T20:32:11.6661399Z             if compiled:
2025-05-07T20:32:11.6661512Z                 op = torch.compile(op)
2025-05-07T20:32:11.6661623Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6661699Z     
2025-05-07T20:32:11.6661802Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6661812Z 
2025-05-07T20:32:11.6661914Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6662062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6662167Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6662274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6662671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6662770Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6663285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6663396Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6663769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6664012Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6664455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6664556Z     kernel = self.compile(
2025-05-07T20:32:11.6664963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6665148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6665287Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6665292Z 
2025-05-07T20:32:11.6665505Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c7b50>
2025-05-07T20:32:11.6666314Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6666855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef1536ca0>}
2025-05-07T20:32:11.6667701Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6667903Z context = <triton._C.libtriton.ir.context object at 0x7f1ef14a32b0>
2025-05-07T20:32:11.6667908Z 
2025-05-07T20:32:11.6668077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6668349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6668467Z                            module_map=module_map)
2025-05-07T20:32:11.6668635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6668747Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6668828Z E       ^
2025-05-07T20:32:11.6669200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6669215Z 
2025-05-07T20:32:11.6669654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6669659Z 
2025-05-07T20:32:11.6669768Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6670007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6670089Z     T=16384,
2025-05-07T20:32:11.6670171Z     D=7168,
2025-05-07T20:32:11.6670266Z     scale_ub=1200.0,
2025-05-07T20:32:11.6670355Z     contiguous=True,
2025-05-07T20:32:11.6670443Z     compiled=True,
2025-05-07T20:32:11.6670528Z )
2025-05-07T20:32:11.6670758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6670942Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6670947Z 
2025-05-07T20:32:11.6671034Z     @given(
2025-05-07T20:32:11.6671164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6671286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6671407Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6671530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6671658Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6671738Z     )
2025-05-07T20:32:11.6671995Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6672103Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6672185Z         self,
2025-05-07T20:32:11.6672270Z         T: int,
2025-05-07T20:32:11.6672359Z         D: int,
2025-05-07T20:32:11.6672464Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6672558Z         contiguous: bool,
2025-05-07T20:32:11.6672661Z         compiled: bool,
2025-05-07T20:32:11.6672744Z     ) -> None:
2025-05-07T20:32:11.6672857Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6672940Z     
2025-05-07T20:32:11.6673200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6673286Z     
2025-05-07T20:32:11.6673385Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6673520Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6673624Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6673709Z         x0 = x[:, :D]
2025-05-07T20:32:11.6673795Z         x1 = x[:, D:]
2025-05-07T20:32:11.6673882Z     
2025-05-07T20:32:11.6673975Z         if contiguous:
2025-05-07T20:32:11.6674074Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6674178Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6674257Z     
2025-05-07T20:32:11.6674354Z         if scale_ub is not None:
2025-05-07T20:32:11.6674472Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6674615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6674707Z             )
2025-05-07T20:32:11.6674788Z         else:
2025-05-07T20:32:11.6674965Z             scale_ub_tensor = None
2025-05-07T20:32:11.6675051Z     
2025-05-07T20:32:11.6675197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6675292Z             op = silu_mul_quant
2025-05-07T20:32:11.6675390Z             if compiled:
2025-05-07T20:32:11.6675497Z                 op = torch.compile(op)
2025-05-07T20:32:11.6675613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6675697Z     
2025-05-07T20:32:11.6675793Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6675797Z 
2025-05-07T20:32:11.6675907Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6676041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6676146Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6676258Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6676639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6676743Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6677266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6677369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6677746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6677980Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6678334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6678445Z     kernel = self.compile(
2025-05-07T20:32:11.6678844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6679027Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6679167Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6679176Z 
2025-05-07T20:32:11.6679399Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef24fdfd0>
2025-05-07T20:32:11.6680322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6680854Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14c8400>}
2025-05-07T20:32:11.6681633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6681834Z context = <triton._C.libtriton.ir.context object at 0x7f1ef143e770>
2025-05-07T20:32:11.6681844Z 
2025-05-07T20:32:11.6682020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6682388Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6682503Z                            module_map=module_map)
2025-05-07T20:32:11.6682671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6682785Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6682868Z E       ^
2025-05-07T20:32:11.6683245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6683249Z 
2025-05-07T20:32:11.6683681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6683685Z 
2025-05-07T20:32:11.6683794Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6684031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6684188Z     T=16384,
2025-05-07T20:32:11.6684277Z     D=5120,
2025-05-07T20:32:11.6684371Z     scale_ub=1200.0,
2025-05-07T20:32:11.6684460Z     contiguous=True,
2025-05-07T20:32:11.6684555Z     compiled=False,
2025-05-07T20:32:11.6684633Z )
2025-05-07T20:32:11.6684860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6685053Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.6685057Z 
2025-05-07T20:32:11.6685138Z     @given(
2025-05-07T20:32:11.6685263Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6685375Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6685497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6685631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6685750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6685829Z     )
2025-05-07T20:32:11.6686094Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6686208Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6686289Z         self,
2025-05-07T20:32:11.6686381Z         T: int,
2025-05-07T20:32:11.6686466Z         D: int,
2025-05-07T20:32:11.6686572Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6686675Z         contiguous: bool,
2025-05-07T20:32:11.6686765Z         compiled: bool,
2025-05-07T20:32:11.6686847Z     ) -> None:
2025-05-07T20:32:11.6686955Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6687033Z     
2025-05-07T20:32:11.6687218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6687299Z     
2025-05-07T20:32:11.6687396Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6687534Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6687627Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6687714Z         x0 = x[:, :D]
2025-05-07T20:32:11.6687808Z         x1 = x[:, D:]
2025-05-07T20:32:11.6687891Z     
2025-05-07T20:32:11.6687981Z         if contiguous:
2025-05-07T20:32:11.6688090Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6688186Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6688263Z     
2025-05-07T20:32:11.6688369Z         if scale_ub is not None:
2025-05-07T20:32:11.6688482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6688625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6688713Z             )
2025-05-07T20:32:11.6688794Z         else:
2025-05-07T20:32:11.6688901Z             scale_ub_tensor = None
2025-05-07T20:32:11.6688979Z     
2025-05-07T20:32:11.6689116Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6689218Z             op = silu_mul_quant
2025-05-07T20:32:11.6689308Z             if compiled:
2025-05-07T20:32:11.6689414Z                 op = torch.compile(op)
2025-05-07T20:32:11.6689533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6689611Z     
2025-05-07T20:32:11.6689711Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6689715Z 
2025-05-07T20:32:11.6689944Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6690080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6690194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6690299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6690817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6690929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6691303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6691537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6691902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6692077Z     kernel = self.compile(
2025-05-07T20:32:11.6692488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6692670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6692802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6692806Z 
2025-05-07T20:32:11.6693026Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c5550>
2025-05-07T20:32:11.6693829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6694364Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14c8e00>}
2025-05-07T20:32:11.6695150Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6695351Z context = <triton._C.libtriton.ir.context object at 0x7f1ef13b7270>
2025-05-07T20:32:11.6695364Z 
2025-05-07T20:32:11.6695539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6695815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6695936Z                            module_map=module_map)
2025-05-07T20:32:11.6696104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6696209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6696300Z E       ^
2025-05-07T20:32:11.6696669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6696678Z 
2025-05-07T20:32:11.6697126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6697131Z 
2025-05-07T20:32:11.6697240Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6697474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6697563Z     T=1,
2025-05-07T20:32:11.6697643Z     D=7168,
2025-05-07T20:32:11.6697731Z     scale_ub=1200.0,
2025-05-07T20:32:11.6697834Z     contiguous=False,
2025-05-07T20:32:11.6697922Z     compiled=False,
2025-05-07T20:32:11.6697998Z )
2025-05-07T20:32:11.6698233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6698409Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.6698414Z 
2025-05-07T20:32:11.6698501Z     @given(
2025-05-07T20:32:11.6698625Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6698727Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6698939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6699066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6699186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6699272Z     )
2025-05-07T20:32:11.6699529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6699635Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6699720Z         self,
2025-05-07T20:32:11.6699800Z         T: int,
2025-05-07T20:32:11.6699886Z         D: int,
2025-05-07T20:32:11.6699989Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6700084Z         contiguous: bool,
2025-05-07T20:32:11.6700180Z         compiled: bool,
2025-05-07T20:32:11.6700262Z     ) -> None:
2025-05-07T20:32:11.6700360Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6700447Z     
2025-05-07T20:32:11.6700621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6700776Z     
2025-05-07T20:32:11.6700881Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6701017Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6701112Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6701205Z         x0 = x[:, :D]
2025-05-07T20:32:11.6701292Z         x1 = x[:, D:]
2025-05-07T20:32:11.6701377Z     
2025-05-07T20:32:11.6701465Z         if contiguous:
2025-05-07T20:32:11.6701562Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6701663Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6701741Z     
2025-05-07T20:32:11.6701836Z         if scale_ub is not None:
2025-05-07T20:32:11.6701956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6702096Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6702176Z             )
2025-05-07T20:32:11.6702271Z         else:
2025-05-07T20:32:11.6702369Z             scale_ub_tensor = None
2025-05-07T20:32:11.6702452Z     
2025-05-07T20:32:11.6702596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6702706Z             op = silu_mul_quant
2025-05-07T20:32:11.6702804Z             if compiled:
2025-05-07T20:32:11.6702909Z                 op = torch.compile(op)
2025-05-07T20:32:11.6703021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6703105Z     
2025-05-07T20:32:11.6703201Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6703206Z 
2025-05-07T20:32:11.6703309Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6703454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6703560Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6703666Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6704196Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6704299Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6704678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6704924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6705284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6705383Z     kernel = self.compile(
2025-05-07T20:32:11.6705778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6705966Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6706096Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6706100Z 
2025-05-07T20:32:11.6706313Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243a850>
2025-05-07T20:32:11.6707210Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6707742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14ca160>}
2025-05-07T20:32:11.6708517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6708718Z context = <triton._C.libtriton.ir.context object at 0x7f1ef13e47f0>
2025-05-07T20:32:11.6708723Z 
2025-05-07T20:32:11.6708902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6709177Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6709289Z                            module_map=module_map)
2025-05-07T20:32:11.6709545Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6709653Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6709734Z E       ^
2025-05-07T20:32:11.6710117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6710122Z 
2025-05-07T20:32:11.6710553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6710557Z 
2025-05-07T20:32:11.6710670Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6710902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6710981Z     T=4096,
2025-05-07T20:32:11.6711067Z     D=7168,
2025-05-07T20:32:11.6711154Z     scale_ub=1200.0,
2025-05-07T20:32:11.6711244Z     contiguous=False,
2025-05-07T20:32:11.6711337Z     compiled=True,
2025-05-07T20:32:11.6711414Z )
2025-05-07T20:32:11.6711642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6711844Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6711848Z 
2025-05-07T20:32:11.6711930Z     @given(
2025-05-07T20:32:11.6712062Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6712167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6712288Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6712418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6712537Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6712617Z     )
2025-05-07T20:32:11.6712883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6712982Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6713072Z         self,
2025-05-07T20:32:11.6713152Z         T: int,
2025-05-07T20:32:11.6713233Z         D: int,
2025-05-07T20:32:11.6713759Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6713909Z         contiguous: bool,
2025-05-07T20:32:11.6714031Z         compiled: bool,
2025-05-07T20:32:11.6714127Z     ) -> None:
2025-05-07T20:32:11.6714223Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6714301Z     
2025-05-07T20:32:11.6714483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6714560Z     
2025-05-07T20:32:11.6714654Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6714792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6714883Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6714973Z         x0 = x[:, :D]
2025-05-07T20:32:11.6715057Z         x1 = x[:, D:]
2025-05-07T20:32:11.6715130Z     
2025-05-07T20:32:11.6715225Z         if contiguous:
2025-05-07T20:32:11.6715321Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6715415Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6715495Z     
2025-05-07T20:32:11.6715588Z         if scale_ub is not None:
2025-05-07T20:32:11.6715698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6716087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6716170Z             )
2025-05-07T20:32:11.6716247Z         else:
2025-05-07T20:32:11.6716350Z             scale_ub_tensor = None
2025-05-07T20:32:11.6716424Z     
2025-05-07T20:32:11.6716557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6716656Z             op = silu_mul_quant
2025-05-07T20:32:11.6716743Z             if compiled:
2025-05-07T20:32:11.6716849Z                 op = torch.compile(op)
2025-05-07T20:32:11.6716957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6717032Z     
2025-05-07T20:32:11.6717134Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6717138Z 
2025-05-07T20:32:11.6717238Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6717371Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6717480Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6717739Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6718127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6718229Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6718741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6718847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6719218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6719451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6719822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6719920Z     kernel = self.compile(
2025-05-07T20:32:11.6720422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6720618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6720749Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6720754Z 
2025-05-07T20:32:11.6720972Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1762350>
2025-05-07T20:32:11.6721775Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6722306Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef14cb420>}
2025-05-07T20:32:11.6723084Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6723293Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1336e70>
2025-05-07T20:32:11.6723298Z 
2025-05-07T20:32:11.6723474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6723750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6723865Z                            module_map=module_map)
2025-05-07T20:32:11.6724031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6724138Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6724222Z E       ^
2025-05-07T20:32:11.6724591Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6724595Z 
2025-05-07T20:32:11.6725033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6725043Z 
2025-05-07T20:32:11.6725234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6725469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6725555Z     T=128,
2025-05-07T20:32:11.6725634Z     D=7168,
2025-05-07T20:32:11.6725718Z     scale_ub=1200.0,
2025-05-07T20:32:11.6725813Z     contiguous=False,
2025-05-07T20:32:11.6725899Z     compiled=True,
2025-05-07T20:32:11.6725975Z )
2025-05-07T20:32:11.6726263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6726509Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:11.6726515Z 
2025-05-07T20:32:11.6726633Z     @given(
2025-05-07T20:32:11.6726802Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6726941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6727113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6727392Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6727564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6727669Z     )
2025-05-07T20:32:11.6728018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6728146Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6728232Z         self,
2025-05-07T20:32:11.6728312Z         T: int,
2025-05-07T20:32:11.6728398Z         D: int,
2025-05-07T20:32:11.6728503Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6728596Z         contiguous: bool,
2025-05-07T20:32:11.6728690Z         compiled: bool,
2025-05-07T20:32:11.6728771Z     ) -> None:
2025-05-07T20:32:11.6728869Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6728950Z     
2025-05-07T20:32:11.6729127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6729203Z     
2025-05-07T20:32:11.6729305Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6729434Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6729531Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6729628Z         x0 = x[:, :D]
2025-05-07T20:32:11.6729711Z         x1 = x[:, D:]
2025-05-07T20:32:11.6729785Z     
2025-05-07T20:32:11.6729883Z         if contiguous:
2025-05-07T20:32:11.6730014Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6730168Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6730439Z     
2025-05-07T20:32:11.6730641Z         if scale_ub is not None:
2025-05-07T20:32:11.6730929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6731271Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6731600Z             )
2025-05-07T20:32:11.6731801Z         else:
2025-05-07T20:32:11.6732011Z             scale_ub_tensor = None
2025-05-07T20:32:11.6732357Z     
2025-05-07T20:32:11.6732632Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6732959Z             op = silu_mul_quant
2025-05-07T20:32:11.6733233Z             if compiled:
2025-05-07T20:32:11.6733492Z                 op = torch.compile(op)
2025-05-07T20:32:11.6733806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6734089Z     
2025-05-07T20:32:11.6734290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6734462Z 
2025-05-07T20:32:11.6734573Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6734874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6735222Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6735516Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6736091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6736676Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6737360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6738081Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6738758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6739478Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6740172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6740720Z     kernel = self.compile(
2025-05-07T20:32:11.6741282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6741963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6742376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6742613Z 
2025-05-07T20:32:11.6742829Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef243b750>
2025-05-07T20:32:11.6743961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6745495Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2028720>}
2025-05-07T20:32:11.6746892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6747952Z context = <triton._C.libtriton.ir.context object at 0x7f1ef20c58b0>
2025-05-07T20:32:11.6748251Z 
2025-05-07T20:32:11.6748424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6748975Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6749467Z                            module_map=module_map)
2025-05-07T20:32:11.6749857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6750232Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6750505Z E       ^
2025-05-07T20:32:11.6750991Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6751458Z 
2025-05-07T20:32:11.6751890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6752434Z 
2025-05-07T20:32:11.6752543Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6752975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6753395Z     T=2048,
2025-05-07T20:32:11.6753592Z     D=7168,
2025-05-07T20:32:11.6753791Z     scale_ub=None,
2025-05-07T20:32:11.6754014Z     contiguous=True,
2025-05-07T20:32:11.6754241Z     compiled=True,
2025-05-07T20:32:11.6754459Z )
2025-05-07T20:32:11.6754797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6755304Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.6755585Z 
2025-05-07T20:32:11.6755664Z     @given(
2025-05-07T20:32:11.6755903Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6756225Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6756546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6756887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6757230Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6757524Z     )
2025-05-07T20:32:11.6757888Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6758355Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6758624Z         self,
2025-05-07T20:32:11.6758857Z         T: int,
2025-05-07T20:32:11.6759063Z         D: int,
2025-05-07T20:32:11.6759289Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6759665Z         contiguous: bool,
2025-05-07T20:32:11.6759920Z         compiled: bool,
2025-05-07T20:32:11.6760254Z     ) -> None:
2025-05-07T20:32:11.6760485Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6760744Z     
2025-05-07T20:32:11.6761021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6761377Z     
2025-05-07T20:32:11.6761579Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6761877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6762205Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6762454Z         x0 = x[:, :D]
2025-05-07T20:32:11.6762680Z         x1 = x[:, D:]
2025-05-07T20:32:11.6762891Z     
2025-05-07T20:32:11.6763086Z         if contiguous:
2025-05-07T20:32:11.6763327Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6763591Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6763844Z     
2025-05-07T20:32:11.6764044Z         if scale_ub is not None:
2025-05-07T20:32:11.6764417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6764775Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6765090Z             )
2025-05-07T20:32:11.6765292Z         else:
2025-05-07T20:32:11.6765514Z             scale_ub_tensor = None
2025-05-07T20:32:11.6765773Z     
2025-05-07T20:32:11.6766015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6766343Z             op = silu_mul_quant
2025-05-07T20:32:11.6766597Z             if compiled:
2025-05-07T20:32:11.6766858Z                 op = torch.compile(op)
2025-05-07T20:32:11.6767165Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6767444Z     
2025-05-07T20:32:11.6767649Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6767824Z 
2025-05-07T20:32:11.6767926Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6768232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6768578Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6768878Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6769456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.6770026Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.6770708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6771417Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6771979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6772679Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6773367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6773920Z     kernel = self.compile(
2025-05-07T20:32:11.6774485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6775165Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6775574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6775809Z 
2025-05-07T20:32:11.6776025Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1fdcdd0>
2025-05-07T20:32:11.6777135Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6778565Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef2029440>}
2025-05-07T20:32:11.6780054Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6781121Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1738330>
2025-05-07T20:32:11.6781426Z 
2025-05-07T20:32:11.6781597Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6782142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6782620Z                            module_map=module_map)
2025-05-07T20:32:11.6782999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6783370Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6783641Z E       ^
2025-05-07T20:32:11.6784118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6784592Z 
2025-05-07T20:32:11.6785025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6785679Z 
2025-05-07T20:32:11.6785795Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6786225Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6786638Z     T=16384,
2025-05-07T20:32:11.6786843Z     D=5120,
2025-05-07T20:32:11.6787043Z     scale_ub=None,
2025-05-07T20:32:11.6787259Z     contiguous=False,
2025-05-07T20:32:11.6787493Z     compiled=False,
2025-05-07T20:32:11.6787851Z )
2025-05-07T20:32:11.6788446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6797121Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6797429Z 
2025-05-07T20:32:11.6797512Z     @given(
2025-05-07T20:32:11.6797768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6798106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6798434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6798799Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6799152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6799458Z     )
2025-05-07T20:32:11.6799827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6800434Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6800695Z         self,
2025-05-07T20:32:11.6800900Z         T: int,
2025-05-07T20:32:11.6801113Z         D: int,
2025-05-07T20:32:11.6801347Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6801630Z         contiguous: bool,
2025-05-07T20:32:11.6801890Z         compiled: bool,
2025-05-07T20:32:11.6802131Z     ) -> None:
2025-05-07T20:32:11.6802359Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6802621Z     
2025-05-07T20:32:11.6802916Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6803280Z     
2025-05-07T20:32:11.6803480Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6803804Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6805920Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6807863Z 
2025-05-07T20:32:11.6807995Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.6808220Z 
2025-05-07T20:32:11.6808329Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6808766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6809193Z     T=4096,
2025-05-07T20:32:11.6809391Z     D=7168,
2025-05-07T20:32:11.6809748Z     scale_ub=1200.0,
2025-05-07T20:32:11.6809985Z     contiguous=True,
2025-05-07T20:32:11.6810219Z     compiled=True,
2025-05-07T20:32:11.6810425Z )
2025-05-07T20:32:11.6810763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6811283Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6811568Z 
2025-05-07T20:32:11.6811649Z     @given(
2025-05-07T20:32:11.6811896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6812234Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6812553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6812899Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6813243Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6813959Z     )
2025-05-07T20:32:11.6814326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6815044Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6815299Z         self,
2025-05-07T20:32:11.6815497Z         T: int,
2025-05-07T20:32:11.6815705Z         D: int,
2025-05-07T20:32:11.6815937Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6816218Z         contiguous: bool,
2025-05-07T20:32:11.6816477Z         compiled: bool,
2025-05-07T20:32:11.6816715Z     ) -> None:
2025-05-07T20:32:11.6816938Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6817197Z     
2025-05-07T20:32:11.6817488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6817840Z     
2025-05-07T20:32:11.6818051Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6818363Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6820451Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6822391Z 
2025-05-07T20:32:11.6822524Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.6822746Z 
2025-05-07T20:32:11.6822854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6823292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6823718Z     T=16384,
2025-05-07T20:32:11.6823917Z     D=7168,
2025-05-07T20:32:11.6824123Z     scale_ub=None,
2025-05-07T20:32:11.6824352Z     contiguous=False,
2025-05-07T20:32:11.6824588Z     compiled=False,
2025-05-07T20:32:11.6824813Z )
2025-05-07T20:32:11.6825154Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6825692Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6825984Z 
2025-05-07T20:32:11.6826066Z     @given(
2025-05-07T20:32:11.6826311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6826639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6826958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6827305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6827649Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6827944Z     )
2025-05-07T20:32:11.6828309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6828776Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6829043Z         self,
2025-05-07T20:32:11.6829243Z         T: int,
2025-05-07T20:32:11.6829453Z         D: int,
2025-05-07T20:32:11.6829685Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6829970Z         contiguous: bool,
2025-05-07T20:32:11.6830376Z         compiled: bool,
2025-05-07T20:32:11.6830614Z     ) -> None:
2025-05-07T20:32:11.6830838Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6831096Z     
2025-05-07T20:32:11.6831382Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6833508Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6835447Z 
2025-05-07T20:32:11.6835648Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.6835876Z 
2025-05-07T20:32:11.6835991Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6836427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6836846Z     T=2048,
2025-05-07T20:32:11.6837036Z     D=7168,
2025-05-07T20:32:11.6837243Z     scale_ub=1200.0,
2025-05-07T20:32:11.6837479Z     contiguous=True,
2025-05-07T20:32:11.6837707Z     compiled=True,
2025-05-07T20:32:11.6837925Z )
2025-05-07T20:32:11.6838262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6838776Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.6839067Z 
2025-05-07T20:32:11.6839148Z     @given(
2025-05-07T20:32:11.6839395Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6839726Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6840048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6840501Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6840861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6841156Z     )
2025-05-07T20:32:11.6841522Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6841987Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6842236Z         self,
2025-05-07T20:32:11.6842447Z         T: int,
2025-05-07T20:32:11.6842659Z         D: int,
2025-05-07T20:32:11.6842884Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6843171Z         contiguous: bool,
2025-05-07T20:32:11.6843431Z         compiled: bool,
2025-05-07T20:32:11.6843660Z     ) -> None:
2025-05-07T20:32:11.6843891Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6844147Z     
2025-05-07T20:32:11.6844426Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6844782Z     
2025-05-07T20:32:11.6844987Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6845301Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6847363Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6849331Z 
2025-05-07T20:32:11.6849455Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.6849684Z 
2025-05-07T20:32:11.6849791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6850227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6850647Z     T=2048,
2025-05-07T20:32:11.6850848Z     D=7168,
2025-05-07T20:32:11.6851050Z     scale_ub=None,
2025-05-07T20:32:11.6851367Z     contiguous=True,
2025-05-07T20:32:11.6851598Z     compiled=False,
2025-05-07T20:32:11.6851812Z )
2025-05-07T20:32:11.6852147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6852656Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6852945Z 
2025-05-07T20:32:11.6853026Z     @given(
2025-05-07T20:32:11.6853265Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6853587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6853905Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6854251Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6854594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6854886Z     )
2025-05-07T20:32:11.6855252Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6855803Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6856059Z         self,
2025-05-07T20:32:11.6856271Z         T: int,
2025-05-07T20:32:11.6856481Z         D: int,
2025-05-07T20:32:11.6856705Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6856994Z         contiguous: bool,
2025-05-07T20:32:11.6857253Z         compiled: bool,
2025-05-07T20:32:11.6857481Z     ) -> None:
2025-05-07T20:32:11.6857714Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6857972Z     
2025-05-07T20:32:11.6858250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6858611Z     
2025-05-07T20:32:11.6858818Z >       x_sign = torch.sign(x)
2025-05-07T20:32:11.6860830Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6862744Z 
2025-05-07T20:32:11.6862883Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:11.6863106Z 
2025-05-07T20:32:11.6863212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6863645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6864065Z     T=1,
2025-05-07T20:32:11.6864260Z     D=7168,
2025-05-07T20:32:11.6864457Z     scale_ub=1200.0,
2025-05-07T20:32:11.6864691Z     contiguous=True,
2025-05-07T20:32:11.6864924Z     compiled=False,
2025-05-07T20:32:11.6865135Z )
2025-05-07T20:32:11.6865469Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6865980Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.6866262Z 
2025-05-07T20:32:11.6866348Z     @given(
2025-05-07T20:32:11.6866588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6866917Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6867234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6867582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6867930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6868229Z     )
2025-05-07T20:32:11.6868590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6869102Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6869363Z         self,
2025-05-07T20:32:11.6869565Z         T: int,
2025-05-07T20:32:11.6869769Z         D: int,
2025-05-07T20:32:11.6869999Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6870275Z         contiguous: bool,
2025-05-07T20:32:11.6870525Z         compiled: bool,
2025-05-07T20:32:11.6870764Z     ) -> None:
2025-05-07T20:32:11.6871199Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6871458Z     
2025-05-07T20:32:11.6871741Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6872092Z     
2025-05-07T20:32:11.6872298Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6872603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6872921Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6873173Z         x0 = x[:, :D]
2025-05-07T20:32:11.6873401Z         x1 = x[:, D:]
2025-05-07T20:32:11.6873620Z     
2025-05-07T20:32:11.6873808Z         if contiguous:
2025-05-07T20:32:11.6874052Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6874323Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6874570Z     
2025-05-07T20:32:11.6874771Z         if scale_ub is not None:
2025-05-07T20:32:11.6875060Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6875407Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6875851Z             )
2025-05-07T20:32:11.6876061Z         else:
2025-05-07T20:32:11.6876276Z             scale_ub_tensor = None
2025-05-07T20:32:11.6876541Z     
2025-05-07T20:32:11.6876784Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6877107Z             op = silu_mul_quant
2025-05-07T20:32:11.6877372Z             if compiled:
2025-05-07T20:32:11.6877634Z                 op = torch.compile(op)
2025-05-07T20:32:11.6877940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6878234Z     
2025-05-07T20:32:11.6878442Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6878614Z 
2025-05-07T20:32:11.6878728Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6879057Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6879430Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6879727Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6880520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6881250Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6881813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6882527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6883216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6883773Z     kernel = self.compile(
2025-05-07T20:32:11.6884341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6885022Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6885440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6885682Z 
2025-05-07T20:32:11.6885902Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1761050>
2025-05-07T20:32:11.6887028Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6888451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c4400>}
2025-05-07T20:32:11.6889842Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6890905Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1701db0>
2025-05-07T20:32:11.6891209Z 
2025-05-07T20:32:11.6891389Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6892030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6892518Z                            module_map=module_map)
2025-05-07T20:32:11.6892904Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6893276Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6893542Z E       ^
2025-05-07T20:32:11.6894028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6894497Z 
2025-05-07T20:32:11.6894938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6895470Z 
2025-05-07T20:32:11.6895585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6896014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6896434Z     T=128,
2025-05-07T20:32:11.6896634Z     D=5120,
2025-05-07T20:32:11.6896910Z     scale_ub=None,
2025-05-07T20:32:11.6897139Z     contiguous=True,
2025-05-07T20:32:11.6897372Z     compiled=False,
2025-05-07T20:32:11.6897579Z )
2025-05-07T20:32:11.6897911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6898424Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6898703Z 
2025-05-07T20:32:11.6898786Z     @given(
2025-05-07T20:32:11.6899021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6899347Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6899667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6900006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6900350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6900650Z     )
2025-05-07T20:32:11.6901014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6901482Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6901737Z         self,
2025-05-07T20:32:11.6901943Z         T: int,
2025-05-07T20:32:11.6902152Z         D: int,
2025-05-07T20:32:11.6902384Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6902663Z         contiguous: bool,
2025-05-07T20:32:11.6902916Z         compiled: bool,
2025-05-07T20:32:11.6903150Z     ) -> None:
2025-05-07T20:32:11.6903373Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6903624Z     
2025-05-07T20:32:11.6903907Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6904264Z     
2025-05-07T20:32:11.6904463Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6904772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6905096Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6905340Z         x0 = x[:, :D]
2025-05-07T20:32:11.6905568Z         x1 = x[:, D:]
2025-05-07T20:32:11.6905785Z     
2025-05-07T20:32:11.6905973Z         if contiguous:
2025-05-07T20:32:11.6906222Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6906499Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6906745Z     
2025-05-07T20:32:11.6906945Z         if scale_ub is not None:
2025-05-07T20:32:11.6907235Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6907580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6907905Z             )
2025-05-07T20:32:11.6908109Z         else:
2025-05-07T20:32:11.6908324Z             scale_ub_tensor = None
2025-05-07T20:32:11.6908587Z     
2025-05-07T20:32:11.6908829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6909164Z             op = silu_mul_quant
2025-05-07T20:32:11.6909419Z             if compiled:
2025-05-07T20:32:11.6909678Z                 op = torch.compile(op)
2025-05-07T20:32:11.6909988Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6910268Z     
2025-05-07T20:32:11.6910469Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6910640Z 
2025-05-07T20:32:11.6910755Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6911149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6911501Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6911794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6912516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6913226Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6914133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6914853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6915542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6916098Z     kernel = self.compile(
2025-05-07T20:32:11.6916666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6917585Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6917998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6918242Z 
2025-05-07T20:32:11.6918459Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c61d0>
2025-05-07T20:32:11.6919585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6921131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c5300>}
2025-05-07T20:32:11.6922529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6923598Z context = <triton._C.libtriton.ir.context object at 0x7f1ef104ef30>
2025-05-07T20:32:11.6923907Z 
2025-05-07T20:32:11.6924083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6924630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6925114Z                            module_map=module_map)
2025-05-07T20:32:11.6925497Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6925867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6926139Z E       ^
2025-05-07T20:32:11.6926622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6927096Z 
2025-05-07T20:32:11.6927529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6928066Z 
2025-05-07T20:32:11.6928189Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6928612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6929029Z     T=128,
2025-05-07T20:32:11.6929223Z     D=7168,
2025-05-07T20:32:11.6929420Z     scale_ub=None,
2025-05-07T20:32:11.6929639Z     contiguous=True,
2025-05-07T20:32:11.6929868Z     compiled=False,
2025-05-07T20:32:11.6930076Z )
2025-05-07T20:32:11.6930409Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6930918Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6931194Z 
2025-05-07T20:32:11.6931271Z     @given(
2025-05-07T20:32:11.6931508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6931833Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6932152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6932495Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6932993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6933295Z     )
2025-05-07T20:32:11.6933650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6934106Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6934357Z         self,
2025-05-07T20:32:11.6934549Z         T: int,
2025-05-07T20:32:11.6934749Z         D: int,
2025-05-07T20:32:11.6934972Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6935249Z         contiguous: bool,
2025-05-07T20:32:11.6935494Z         compiled: bool,
2025-05-07T20:32:11.6935723Z     ) -> None:
2025-05-07T20:32:11.6935937Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6936185Z     
2025-05-07T20:32:11.6936464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6936817Z     
2025-05-07T20:32:11.6937011Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6937404Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6937733Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6937977Z         x0 = x[:, :D]
2025-05-07T20:32:11.6938201Z         x1 = x[:, D:]
2025-05-07T20:32:11.6938412Z     
2025-05-07T20:32:11.6938596Z         if contiguous:
2025-05-07T20:32:11.6938833Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6939100Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6939341Z     
2025-05-07T20:32:11.6939559Z         if scale_ub is not None:
2025-05-07T20:32:11.6939843Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6940189Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6940510Z             )
2025-05-07T20:32:11.6940703Z         else:
2025-05-07T20:32:11.6940921Z             scale_ub_tensor = None
2025-05-07T20:32:11.6941184Z     
2025-05-07T20:32:11.6941418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6941743Z             op = silu_mul_quant
2025-05-07T20:32:11.6942010Z             if compiled:
2025-05-07T20:32:11.6942118Z                 op = torch.compile(op)
2025-05-07T20:32:11.6942227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6942309Z     
2025-05-07T20:32:11.6942406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6942411Z 
2025-05-07T20:32:11.6942513Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6942651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6942754Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6942864Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6943381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6943482Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6943860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6944101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6944457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6944560Z     kernel = self.compile(
2025-05-07T20:32:11.6944955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6945142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6945272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6945277Z 
2025-05-07T20:32:11.6945486Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef13c6650>
2025-05-07T20:32:11.6946487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6947137Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c60c0>}
2025-05-07T20:32:11.6947965Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6956803Z context = <triton._C.libtriton.ir.context object at 0x7f1ef1099530>
2025-05-07T20:32:11.6956814Z 
2025-05-07T20:32:11.6957002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6957292Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6957409Z                            module_map=module_map)
2025-05-07T20:32:11.6957582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6957862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6957947Z E       ^
2025-05-07T20:32:11.6958329Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6958334Z 
2025-05-07T20:32:11.6958825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6958830Z 
2025-05-07T20:32:11.6958947Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6959190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6959273Z     T=2048,
2025-05-07T20:32:11.6959359Z     D=7168,
2025-05-07T20:32:11.6959456Z     scale_ub=1200.0,
2025-05-07T20:32:11.6959545Z     contiguous=True,
2025-05-07T20:32:11.6959633Z     compiled=False,
2025-05-07T20:32:11.6959720Z )
2025-05-07T20:32:11.6959950Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6960245Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.6960265Z 
2025-05-07T20:32:11.6960353Z     @given(
2025-05-07T20:32:11.6960482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6960596Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6960718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6960842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6960972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6961052Z     )
2025-05-07T20:32:11.6961312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6961420Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6961502Z         self,
2025-05-07T20:32:11.6961584Z         T: int,
2025-05-07T20:32:11.6961673Z         D: int,
2025-05-07T20:32:11.6961778Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6961883Z         contiguous: bool,
2025-05-07T20:32:11.6961976Z         compiled: bool,
2025-05-07T20:32:11.6962068Z     ) -> None:
2025-05-07T20:32:11.6962182Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6962261Z     
2025-05-07T20:32:11.6962440Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6964309Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6964315Z 
2025-05-07T20:32:11.6964440Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.6964445Z 
2025-05-07T20:32:11.6964562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6964894Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6964977Z     T=1,
2025-05-07T20:32:11.6965064Z     D=5120,
2025-05-07T20:32:11.6965152Z     scale_ub=1200.0,
2025-05-07T20:32:11.6965246Z     contiguous=True,
2025-05-07T20:32:11.6965333Z     compiled=False,
2025-05-07T20:32:11.6965409Z )
2025-05-07T20:32:11.6965642Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6965815Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.6965819Z 
2025-05-07T20:32:11.6965899Z     @given(
2025-05-07T20:32:11.6966030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6966132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6966253Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6966387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6966505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6966720Z     )
2025-05-07T20:32:11.6966981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6967079Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6967166Z         self,
2025-05-07T20:32:11.6967246Z         T: int,
2025-05-07T20:32:11.6967326Z         D: int,
2025-05-07T20:32:11.6967436Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6967530Z         contiguous: bool,
2025-05-07T20:32:11.6967620Z         compiled: bool,
2025-05-07T20:32:11.6967709Z     ) -> None:
2025-05-07T20:32:11.6967808Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6967884Z     
2025-05-07T20:32:11.6968068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6968148Z     
2025-05-07T20:32:11.6968252Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.6968383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.6968474Z         x = x_sign * x_clamp
2025-05-07T20:32:11.6968570Z         x0 = x[:, :D]
2025-05-07T20:32:11.6968657Z         x1 = x[:, D:]
2025-05-07T20:32:11.6968733Z     
2025-05-07T20:32:11.6968830Z         if contiguous:
2025-05-07T20:32:11.6968925Z             x0 = x0.contiguous()
2025-05-07T20:32:11.6969019Z             x1 = x1.contiguous()
2025-05-07T20:32:11.6969100Z     
2025-05-07T20:32:11.6969194Z         if scale_ub is not None:
2025-05-07T20:32:11.6969304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.6969453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.6969531Z             )
2025-05-07T20:32:11.6969618Z         else:
2025-05-07T20:32:11.6969716Z             scale_ub_tensor = None
2025-05-07T20:32:11.6969790Z     
2025-05-07T20:32:11.6969935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.6970032Z             op = silu_mul_quant
2025-05-07T20:32:11.6970122Z             if compiled:
2025-05-07T20:32:11.6970231Z                 op = torch.compile(op)
2025-05-07T20:32:11.6970348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6970428Z     
2025-05-07T20:32:11.6970529Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.6970533Z 
2025-05-07T20:32:11.6970635Z moe/activation_test.py:117: 
2025-05-07T20:32:11.6970772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6970885Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.6970990Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.6971517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.6971620Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.6971996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.6972240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.6972689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.6972800Z     kernel = self.compile(
2025-05-07T20:32:11.6973200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.6973383Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.6973522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.6973526Z 
2025-05-07T20:32:11.6973739Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef20ffc50>
2025-05-07T20:32:11.6974544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.6975084Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef17c76a0>}
2025-05-07T20:32:11.6975937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.6976147Z context = <triton._C.libtriton.ir.context object at 0x7f1ef0f867f0>
2025-05-07T20:32:11.6976151Z 
2025-05-07T20:32:11.6976326Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.6976609Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.6976724Z                            module_map=module_map)
2025-05-07T20:32:11.6976891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.6977003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.6977083Z E       ^
2025-05-07T20:32:11.6977460Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.6977475Z 
2025-05-07T20:32:11.6977915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.6977920Z 
2025-05-07T20:32:11.6978028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6978269Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6978354Z     T=2048,
2025-05-07T20:32:11.6978434Z     D=5120,
2025-05-07T20:32:11.6978534Z     scale_ub=None,
2025-05-07T20:32:11.6978642Z     contiguous=True,
2025-05-07T20:32:11.6978739Z     compiled=False,
2025-05-07T20:32:11.6978840Z )
2025-05-07T20:32:11.6979067Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6979260Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6979264Z 
2025-05-07T20:32:11.6979344Z     @given(
2025-05-07T20:32:11.6979474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6979592Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6979711Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6979833Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6979958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6980036Z     )
2025-05-07T20:32:11.6980293Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6980399Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6980479Z         self,
2025-05-07T20:32:11.6980565Z         T: int,
2025-05-07T20:32:11.6980645Z         D: int,
2025-05-07T20:32:11.6980746Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6980847Z         contiguous: bool,
2025-05-07T20:32:11.6980935Z         compiled: bool,
2025-05-07T20:32:11.6981021Z     ) -> None:
2025-05-07T20:32:11.6981127Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6981208Z     
2025-05-07T20:32:11.6981474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6981563Z     
2025-05-07T20:32:11.6981659Z >       x_sign = torch.sign(x)
2025-05-07T20:32:11.6983519Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6983525Z 
2025-05-07T20:32:11.6983648Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:11.6983653Z 
2025-05-07T20:32:11.6983764Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6984093Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6984175Z     T=16384,
2025-05-07T20:32:11.6984266Z     D=5120,
2025-05-07T20:32:11.6984350Z     scale_ub=None,
2025-05-07T20:32:11.6984438Z     contiguous=True,
2025-05-07T20:32:11.6984532Z     compiled=False,
2025-05-07T20:32:11.6984608Z )
2025-05-07T20:32:11.6984837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6985030Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6985034Z 
2025-05-07T20:32:11.6985115Z     @given(
2025-05-07T20:32:11.6985238Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6985350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6985468Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6985595Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6985713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6985796Z     )
2025-05-07T20:32:11.6986062Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6986160Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6986238Z         self,
2025-05-07T20:32:11.6986324Z         T: int,
2025-05-07T20:32:11.6986403Z         D: int,
2025-05-07T20:32:11.6986505Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6986605Z         contiguous: bool,
2025-05-07T20:32:11.6986693Z         compiled: bool,
2025-05-07T20:32:11.6986774Z     ) -> None:
2025-05-07T20:32:11.6986878Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6986955Z     
2025-05-07T20:32:11.6987137Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6989024Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6989036Z 
2025-05-07T20:32:11.6989165Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.6989170Z 
2025-05-07T20:32:11.6989279Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6989510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6989598Z     T=4096,
2025-05-07T20:32:11.6989679Z     D=5120,
2025-05-07T20:32:11.6989766Z     scale_ub=None,
2025-05-07T20:32:11.6989864Z     contiguous=True,
2025-05-07T20:32:11.6989953Z     compiled=False,
2025-05-07T20:32:11.6990030Z )
2025-05-07T20:32:11.6990264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6990447Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.6990535Z 
2025-05-07T20:32:11.6990625Z     @given(
2025-05-07T20:32:11.6990748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6990854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6990981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6991101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6991218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6991303Z     )
2025-05-07T20:32:11.6991557Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6991668Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6991748Z         self,
2025-05-07T20:32:11.6991828Z         T: int,
2025-05-07T20:32:11.6991915Z         D: int,
2025-05-07T20:32:11.6992018Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6992113Z         contiguous: bool,
2025-05-07T20:32:11.6992291Z         compiled: bool,
2025-05-07T20:32:11.6992373Z     ) -> None:
2025-05-07T20:32:11.6992476Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6992562Z     
2025-05-07T20:32:11.6992739Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6994568Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.6994574Z 
2025-05-07T20:32:11.6994697Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.6994706Z 
2025-05-07T20:32:11.6994813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.6995057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.6995137Z     T=2048,
2025-05-07T20:32:11.6995224Z     D=5120,
2025-05-07T20:32:11.6995310Z     scale_ub=None,
2025-05-07T20:32:11.6995400Z     contiguous=False,
2025-05-07T20:32:11.6995495Z     compiled=False,
2025-05-07T20:32:11.6995572Z )
2025-05-07T20:32:11.6995797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.6995985Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.6995990Z 
2025-05-07T20:32:11.6996069Z     @given(
2025-05-07T20:32:11.6996192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.6996303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.6996421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.6996550Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.6996673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.6996753Z     )
2025-05-07T20:32:11.6997015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.6997114Z     def test_silu_mul_quant(
2025-05-07T20:32:11.6997194Z         self,
2025-05-07T20:32:11.6997281Z         T: int,
2025-05-07T20:32:11.6997360Z         D: int,
2025-05-07T20:32:11.6997461Z         scale_ub: Optional[float],
2025-05-07T20:32:11.6997560Z         contiguous: bool,
2025-05-07T20:32:11.6997649Z         compiled: bool,
2025-05-07T20:32:11.6997729Z     ) -> None:
2025-05-07T20:32:11.6997836Z         torch.manual_seed(2025)
2025-05-07T20:32:11.6997912Z     
2025-05-07T20:32:11.6998094Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.6999997Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7000009Z 
2025-05-07T20:32:11.7000235Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7000240Z 
2025-05-07T20:32:11.7000352Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7000585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7000672Z     T=4096,
2025-05-07T20:32:11.7000751Z     D=7168,
2025-05-07T20:32:11.7000838Z     scale_ub=None,
2025-05-07T20:32:11.7000933Z     contiguous=True,
2025-05-07T20:32:11.7001018Z     compiled=True,
2025-05-07T20:32:11.7001094Z )
2025-05-07T20:32:11.7001328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7001620Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.7001625Z 
2025-05-07T20:32:11.7001710Z     @given(
2025-05-07T20:32:11.7001834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7001937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7002061Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7002181Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7002299Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7002384Z     )
2025-05-07T20:32:11.7002638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7002745Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7002826Z         self,
2025-05-07T20:32:11.7002905Z         T: int,
2025-05-07T20:32:11.7002992Z         D: int,
2025-05-07T20:32:11.7003096Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7003195Z         contiguous: bool,
2025-05-07T20:32:11.7003294Z         compiled: bool,
2025-05-07T20:32:11.7003375Z     ) -> None:
2025-05-07T20:32:11.7003473Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7003555Z     
2025-05-07T20:32:11.7003730Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7005563Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7005569Z 
2025-05-07T20:32:11.7005695Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7005700Z 
2025-05-07T20:32:11.7005809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7006047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7006129Z     T=2048,
2025-05-07T20:32:11.7006214Z     D=5120,
2025-05-07T20:32:11.7006302Z     scale_ub=1200.0,
2025-05-07T20:32:11.7006390Z     contiguous=False,
2025-05-07T20:32:11.7006487Z     compiled=False,
2025-05-07T20:32:11.7006563Z )
2025-05-07T20:32:11.7006790Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7006984Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.7006988Z 
2025-05-07T20:32:11.7007071Z     @given(
2025-05-07T20:32:11.7007193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7007304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7007423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7007557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7007762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7007841Z     )
2025-05-07T20:32:11.7008104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7008203Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7008284Z         self,
2025-05-07T20:32:11.7008371Z         T: int,
2025-05-07T20:32:11.7008451Z         D: int,
2025-05-07T20:32:11.7008554Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7008655Z         contiguous: bool,
2025-05-07T20:32:11.7008745Z         compiled: bool,
2025-05-07T20:32:11.7008826Z     ) -> None:
2025-05-07T20:32:11.7008933Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7009010Z     
2025-05-07T20:32:11.7009192Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7011021Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7011107Z 
2025-05-07T20:32:11.7011237Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7011242Z 
2025-05-07T20:32:11.7011356Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7011588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7011675Z     T=4096,
2025-05-07T20:32:11.7011755Z     D=7168,
2025-05-07T20:32:11.7011846Z     scale_ub=1200.0,
2025-05-07T20:32:11.7011933Z     contiguous=True,
2025-05-07T20:32:11.7012020Z     compiled=False,
2025-05-07T20:32:11.7012110Z )
2025-05-07T20:32:11.7012342Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7012522Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.7012527Z 
2025-05-07T20:32:11.7012611Z     @given(
2025-05-07T20:32:11.7012733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7012835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7012959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7013080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7013202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7013280Z     )
2025-05-07T20:32:11.7014114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7014229Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7014310Z         self,
2025-05-07T20:32:11.7014390Z         T: int,
2025-05-07T20:32:11.7014484Z         D: int,
2025-05-07T20:32:11.7014587Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7014683Z         contiguous: bool,
2025-05-07T20:32:11.7014779Z         compiled: bool,
2025-05-07T20:32:11.7014861Z     ) -> None:
2025-05-07T20:32:11.7014963Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7015048Z     
2025-05-07T20:32:11.7015224Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7017063Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7017073Z 
2025-05-07T20:32:11.7017443Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7017449Z 
2025-05-07T20:32:11.7017567Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7017798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7017878Z     T=16384,
2025-05-07T20:32:11.7017967Z     D=7168,
2025-05-07T20:32:11.7018053Z     scale_ub=None,
2025-05-07T20:32:11.7018147Z     contiguous=False,
2025-05-07T20:32:11.7018240Z     compiled=True,
2025-05-07T20:32:11.7018317Z )
2025-05-07T20:32:11.7018541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7018730Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:11.7018734Z 
2025-05-07T20:32:11.7018812Z     @given(
2025-05-07T20:32:11.7018938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7019039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7019280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7019410Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7019526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7019604Z     )
2025-05-07T20:32:11.7019865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7019962Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7020050Z         self,
2025-05-07T20:32:11.7020130Z         T: int,
2025-05-07T20:32:11.7020209Z         D: int,
2025-05-07T20:32:11.7020315Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7020412Z         contiguous: bool,
2025-05-07T20:32:11.7020502Z         compiled: bool,
2025-05-07T20:32:11.7020588Z     ) -> None:
2025-05-07T20:32:11.7020687Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7020763Z     
2025-05-07T20:32:11.7020945Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7022784Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7022790Z 
2025-05-07T20:32:11.7022917Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7022921Z 
2025-05-07T20:32:11.7023027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7023266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7023346Z     T=4096,
2025-05-07T20:32:11.7023425Z     D=7168,
2025-05-07T20:32:11.7023515Z     scale_ub=None,
2025-05-07T20:32:11.7023609Z     contiguous=True,
2025-05-07T20:32:11.7023697Z     compiled=False,
2025-05-07T20:32:11.7023784Z )
2025-05-07T20:32:11.7024011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7024188Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.7024192Z 
2025-05-07T20:32:11.7024282Z     @given(
2025-05-07T20:32:11.7024405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7024508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7024632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7024751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7024872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7024950Z     )
2025-05-07T20:32:11.7025202Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7025305Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7025390Z         self,
2025-05-07T20:32:11.7025471Z         T: int,
2025-05-07T20:32:11.7025645Z         D: int,
2025-05-07T20:32:11.7025747Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7025839Z         contiguous: bool,
2025-05-07T20:32:11.7025937Z         compiled: bool,
2025-05-07T20:32:11.7026017Z     ) -> None:
2025-05-07T20:32:11.7026115Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7026197Z     
2025-05-07T20:32:11.7026372Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7028210Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7028289Z 
2025-05-07T20:32:11.7028412Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7028417Z 
2025-05-07T20:32:11.7028530Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7028764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7028848Z     T=16384,
2025-05-07T20:32:11.7028937Z     D=7168,
2025-05-07T20:32:11.7029022Z     scale_ub=None,
2025-05-07T20:32:11.7029111Z     contiguous=True,
2025-05-07T20:32:11.7029205Z     compiled=False,
2025-05-07T20:32:11.7029285Z )
2025-05-07T20:32:11.7029512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7029700Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:11.7029705Z 
2025-05-07T20:32:11.7029784Z     @given(
2025-05-07T20:32:11.7029914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7030026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7030148Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7030277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7030395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7030476Z     )
2025-05-07T20:32:11.7030739Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7030837Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7030924Z         self,
2025-05-07T20:32:11.7031005Z         T: int,
2025-05-07T20:32:11.7031086Z         D: int,
2025-05-07T20:32:11.7031202Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7031294Z         contiguous: bool,
2025-05-07T20:32:11.7031384Z         compiled: bool,
2025-05-07T20:32:11.7031473Z     ) -> None:
2025-05-07T20:32:11.7031571Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7031647Z     
2025-05-07T20:32:11.7031829Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7033668Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7033675Z 
2025-05-07T20:32:11.7033803Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7033807Z 
2025-05-07T20:32:11.7033917Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7034153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7034234Z     T=16384,
2025-05-07T20:32:11.7034318Z     D=7168,
2025-05-07T20:32:11.7034410Z     scale_ub=1200.0,
2025-05-07T20:32:11.7034607Z     contiguous=True,
2025-05-07T20:32:11.7034695Z     compiled=False,
2025-05-07T20:32:11.7034780Z )
2025-05-07T20:32:11.7035005Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7035188Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.7035192Z 
2025-05-07T20:32:11.7035280Z     @given(
2025-05-07T20:32:11.7035405Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7035508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7035633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7035753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7035876Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7035954Z     )
2025-05-07T20:32:11.7036207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7036491Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7036576Z         self,
2025-05-07T20:32:11.7036656Z         T: int,
2025-05-07T20:32:11.7036748Z         D: int,
2025-05-07T20:32:11.7036849Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7036941Z         contiguous: bool,
2025-05-07T20:32:11.7037036Z         compiled: bool,
2025-05-07T20:32:11.7037117Z     ) -> None:
2025-05-07T20:32:11.7037214Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7037297Z     
2025-05-07T20:32:11.7037472Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7039305Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7039316Z 
2025-05-07T20:32:11.7039437Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7039442Z 
2025-05-07T20:32:11.7039553Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7039785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7039865Z     T=128,
2025-05-07T20:32:11.7039952Z     D=5120,
2025-05-07T20:32:11.7040038Z     scale_ub=1200.0,
2025-05-07T20:32:11.7040214Z     contiguous=False,
2025-05-07T20:32:11.7040307Z     compiled=False,
2025-05-07T20:32:11.7040381Z )
2025-05-07T20:32:11.7040605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7040786Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:11.7040791Z 
2025-05-07T20:32:11.7040877Z     @given(
2025-05-07T20:32:11.7041007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7041106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7041223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7041347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7041463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7041544Z     )
2025-05-07T20:32:11.7041802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7041898Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7041982Z         self,
2025-05-07T20:32:11.7042060Z         T: int,
2025-05-07T20:32:11.7042136Z         D: int,
2025-05-07T20:32:11.7042241Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7042330Z         contiguous: bool,
2025-05-07T20:32:11.7042416Z         compiled: bool,
2025-05-07T20:32:11.7042501Z     ) -> None:
2025-05-07T20:32:11.7042598Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7042677Z     
2025-05-07T20:32:11.7042945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7043021Z     
2025-05-07T20:32:11.7043115Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7043251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7043343Z         x = x_sign * x_clamp
2025-05-07T20:32:11.7043425Z         x0 = x[:, :D]
2025-05-07T20:32:11.7043513Z         x1 = x[:, D:]
2025-05-07T20:32:11.7043586Z     
2025-05-07T20:32:11.7043683Z         if contiguous:
2025-05-07T20:32:11.7043776Z             x0 = x0.contiguous()
2025-05-07T20:32:11.7043867Z             x1 = x1.contiguous()
2025-05-07T20:32:11.7043949Z     
2025-05-07T20:32:11.7044043Z         if scale_ub is not None:
2025-05-07T20:32:11.7044150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.7044297Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.7044374Z             )
2025-05-07T20:32:11.7044535Z         else:
2025-05-07T20:32:11.7044640Z             scale_ub_tensor = None
2025-05-07T20:32:11.7044721Z     
2025-05-07T20:32:11.7044853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7044954Z             op = silu_mul_quant
2025-05-07T20:32:11.7045041Z             if compiled:
2025-05-07T20:32:11.7045148Z                 op = torch.compile(op)
2025-05-07T20:32:11.7045258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7045332Z     
2025-05-07T20:32:11.7045430Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.7045434Z 
2025-05-07T20:32:11.7045534Z moe/activation_test.py:117: 
2025-05-07T20:32:11.7045667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7045777Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.7045881Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7046396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.7046507Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.7046886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.7047125Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.7047477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.7047576Z     kernel = self.compile(
2025-05-07T20:32:11.7047978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.7048158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7048295Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7048300Z 
2025-05-07T20:32:11.7048510Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef20feed0>
2025-05-07T20:32:11.7049320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.7049850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef0f39bc0>}
2025-05-07T20:32:11.7050619Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.7050821Z context = <triton._C.libtriton.ir.context object at 0x7f1ef0e804b0>
2025-05-07T20:32:11.7050825Z 
2025-05-07T20:32:11.7050999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.7051270Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7051474Z                            module_map=module_map)
2025-05-07T20:32:11.7051643Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7051753Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.7051835Z E       ^
2025-05-07T20:32:11.7052201Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7052206Z 
2025-05-07T20:32:11.7052639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.7052643Z 
2025-05-07T20:32:11.7052750Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7052984Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7053064Z     T=2048,
2025-05-07T20:32:11.7053143Z     D=7168,
2025-05-07T20:32:11.7053233Z     scale_ub=None,
2025-05-07T20:32:11.7053323Z     contiguous=False,
2025-05-07T20:32:11.7053492Z     compiled=False,
2025-05-07T20:32:11.7053574Z )
2025-05-07T20:32:11.7053803Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7053985Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:11.7053989Z 
2025-05-07T20:32:11.7054072Z     @given(
2025-05-07T20:32:11.7054193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7054302Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7054419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7054538Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7054660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7054736Z     )
2025-05-07T20:32:11.7054990Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7055092Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7055171Z         self,
2025-05-07T20:32:11.7055256Z         T: int,
2025-05-07T20:32:11.7055340Z         D: int,
2025-05-07T20:32:11.7055447Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7055538Z         contiguous: bool,
2025-05-07T20:32:11.7055631Z         compiled: bool,
2025-05-07T20:32:11.7055711Z     ) -> None:
2025-05-07T20:32:11.7055814Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7055890Z     
2025-05-07T20:32:11.7056064Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7057910Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7057921Z 
2025-05-07T20:32:11.7058045Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7058050Z 
2025-05-07T20:32:11.7058162Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7058394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7058473Z     T=128,
2025-05-07T20:32:11.7058558Z     D=7168,
2025-05-07T20:32:11.7058657Z     scale_ub=1200.0,
2025-05-07T20:32:11.7058755Z     contiguous=True,
2025-05-07T20:32:11.7058862Z     compiled=True,
2025-05-07T20:32:11.7058948Z )
2025-05-07T20:32:11.7059179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7059351Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.7059355Z 
2025-05-07T20:32:11.7059434Z     @given(
2025-05-07T20:32:11.7059560Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7059669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7059866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7059994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7060112Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7060190Z     )
2025-05-07T20:32:11.7060450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7060547Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7060635Z         self,
2025-05-07T20:32:11.7060713Z         T: int,
2025-05-07T20:32:11.7060793Z         D: int,
2025-05-07T20:32:11.7060899Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7060990Z         contiguous: bool,
2025-05-07T20:32:11.7061080Z         compiled: bool,
2025-05-07T20:32:11.7061166Z     ) -> None:
2025-05-07T20:32:11.7061266Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7061343Z     
2025-05-07T20:32:11.7061520Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7061672Z     
2025-05-07T20:32:11.7061773Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7061907Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7061998Z         x = x_sign * x_clamp
2025-05-07T20:32:11.7062092Z         x0 = x[:, :D]
2025-05-07T20:32:11.7062177Z         x1 = x[:, D:]
2025-05-07T20:32:11.7062253Z     
2025-05-07T20:32:11.7062347Z         if contiguous:
2025-05-07T20:32:11.7062441Z             x0 = x0.contiguous()
2025-05-07T20:32:11.7062536Z             x1 = x1.contiguous()
2025-05-07T20:32:11.7062620Z     
2025-05-07T20:32:11.7062712Z         if scale_ub is not None:
2025-05-07T20:32:11.7062821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:11.7062974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:11.7063054Z             )
2025-05-07T20:32:11.7063134Z         else:
2025-05-07T20:32:11.7063237Z             scale_ub_tensor = None
2025-05-07T20:32:11.7063314Z     
2025-05-07T20:32:11.7063454Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:11.7063559Z             op = silu_mul_quant
2025-05-07T20:32:11.7063646Z             if compiled:
2025-05-07T20:32:11.7063753Z                 op = torch.compile(op)
2025-05-07T20:32:11.7063862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7063938Z     
2025-05-07T20:32:11.7064036Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:11.7064040Z 
2025-05-07T20:32:11.7064141Z moe/activation_test.py:117: 
2025-05-07T20:32:11.7064271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7064380Z moe/activation_test.py:115: in fn
2025-05-07T20:32:11.7064484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:11.7064864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:11.7064966Z     return fn(*args, **kwargs)
2025-05-07T20:32:11.7065474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:11.7065591Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:11.7065960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:11.7066190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:11.7066550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:11.7066647Z     kernel = self.compile(
2025-05-07T20:32:11.7067050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:11.7067232Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:11.7067369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:11.7067373Z 
2025-05-07T20:32:11.7067584Z self = <triton.compiler.compiler.ASTSource object at 0x7f1ef1fddf50>
2025-05-07T20:32:11.7068501Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:11.7069085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1f5ed77ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f1ef0e2c2c0>}
2025-05-07T20:32:11.7069854Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:11.7070062Z context = <triton._C.libtriton.ir.context object at 0x7f1ef0ec7b70>
2025-05-07T20:32:11.7070067Z 
2025-05-07T20:32:11.7070240Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:11.7070598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:11.7070712Z                            module_map=module_map)
2025-05-07T20:32:11.7070879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:11.7070994Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:11.7071074Z E       ^
2025-05-07T20:32:11.7071440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:11.7071444Z 
2025-05-07T20:32:11.7071882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:11.7071887Z 
2025-05-07T20:32:11.7071994Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7072228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7072309Z     T=128,
2025-05-07T20:32:11.7072388Z     D=7168,
2025-05-07T20:32:11.7072487Z     scale_ub=1200.0,
2025-05-07T20:32:11.7072576Z     contiguous=True,
2025-05-07T20:32:11.7072665Z     compiled=False,
2025-05-07T20:32:11.7072746Z )
2025-05-07T20:32:11.7072971Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7073149Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:11.7073160Z 
2025-05-07T20:32:11.7073238Z     @given(
2025-05-07T20:32:11.7073360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7073467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7073583Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7073703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7073824Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7073901Z     )
2025-05-07T20:32:11.7074157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7074259Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7074343Z         self,
2025-05-07T20:32:11.7074422Z         T: int,
2025-05-07T20:32:11.7074514Z         D: int,
2025-05-07T20:32:11.7074614Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7074712Z         contiguous: bool,
2025-05-07T20:32:11.7074800Z         compiled: bool,
2025-05-07T20:32:11.7074880Z     ) -> None:
2025-05-07T20:32:11.7074984Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7075060Z     
2025-05-07T20:32:11.7075232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7075315Z     
2025-05-07T20:32:11.7075409Z         x_sign = torch.sign(x)
2025-05-07T20:32:11.7075537Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:11.7077467Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7077478Z 
2025-05-07T20:32:11.7077601Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:11.7077606Z 
2025-05-07T20:32:11.7077718Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7077949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7078034Z     T=128,
2025-05-07T20:32:11.7078114Z     D=5120,
2025-05-07T20:32:11.7078198Z     scale_ub=1200.0,
2025-05-07T20:32:11.7078290Z     contiguous=True,
2025-05-07T20:32:11.7078377Z     compiled=True,
2025-05-07T20:32:11.7078452Z )
2025-05-07T20:32:11.7078681Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7078855Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:11.7078937Z 
2025-05-07T20:32:11.7079021Z     @given(
2025-05-07T20:32:11.7079149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7079266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7079383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7079503Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7079628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7079705Z     )
2025-05-07T20:32:11.7079957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7080063Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7080196Z         self,
2025-05-07T20:32:11.7080275Z         T: int,
2025-05-07T20:32:11.7080364Z         D: int,
2025-05-07T20:32:11.7080464Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7080561Z         contiguous: bool,
2025-05-07T20:32:11.7080650Z         compiled: bool,
2025-05-07T20:32:11.7080738Z     ) -> None:
2025-05-07T20:32:11.7080847Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7080927Z     
2025-05-07T20:32:11.7081100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7081183Z     
2025-05-07T20:32:11.7081278Z >       x_sign = torch.sign(x)
2025-05-07T20:32:11.7083103Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7083109Z 
2025-05-07T20:32:11.7083229Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:11.7083238Z 
2025-05-07T20:32:11.7083349Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:11.7083585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:11.7083666Z     T=128,
2025-05-07T20:32:11.7083751Z     D=7168,
2025-05-07T20:32:11.7083835Z     scale_ub=None,
2025-05-07T20:32:11.7083923Z     contiguous=True,
2025-05-07T20:32:11.7084014Z     compiled=True,
2025-05-07T20:32:11.7084091Z )
2025-05-07T20:32:11.7084317Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:11.7084493Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:11.7084497Z 
2025-05-07T20:32:11.7084574Z     @given(
2025-05-07T20:32:11.7084695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:11.7084801Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:11.7084920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:11.7085051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:11.7085250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:11.7085328Z     )
2025-05-07T20:32:11.7085589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:11.7085687Z     def test_silu_mul_quant(
2025-05-07T20:32:11.7085852Z         self,
2025-05-07T20:32:11.7085969Z         T: int,
2025-05-07T20:32:11.7093994Z         D: int,
2025-05-07T20:32:11.7094130Z         scale_ub: Optional[float],
2025-05-07T20:32:11.7094227Z         contiguous: bool,
2025-05-07T20:32:11.7094318Z         compiled: bool,
2025-05-07T20:32:11.7094407Z     ) -> None:
2025-05-07T20:32:11.7094507Z         torch.manual_seed(2025)
2025-05-07T20:32:11.7094584Z     
2025-05-07T20:32:11.7094780Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:11.7096638Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:11.7096797Z 
2025-05-07T20:32:11.7096934Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:11.7097079Z =============================== warnings summary ===============================
2025-05-07T20:32:11.7097408Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:11.7097732Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:11.7098045Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:11.7098973Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:11.7099216Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:11.7099221Z 
2025-05-07T20:32:11.7099415Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:11.7100728Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:11.7100929Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:11.7100938Z 
2025-05-07T20:32:11.7101166Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:11.7101337Z ================== 1 failed, 1 passed, 13 warnings in 18.83s ===================
2025-05-07T20:32:13.5047905Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:13.5673379Z 
2025-05-07T20:32:13.5673852Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:13.5674372Z 
2025-05-07T20:32:13.5674378Z 
2025-05-07T20:32:13.5696193Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:15.7274146Z ============================= test session starts ==============================
2025-05-07T20:32:15.7275655Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:15.7276641Z cachedir: .pytest_cache
2025-05-07T20:32:15.7277621Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:15.7278926Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:15.7279688Z plugins: hypothesis-6.131.14
2025-05-07T20:32:17.2756923Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:17.3721923Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:17.3722345Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:17.3722574Z 
2025-05-07T20:32:19.2153885Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:19.2156743Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:19.2159599Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:19.2161778Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:19.2162831Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:19.2164219Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:19.2165705Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2167105Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:19.2168575Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2169695Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:19.2171047Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:19.2172377Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:19.2173279Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:19.2174565Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:19.2175860Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:19.2177133Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:19.2178228Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:19.2179532Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:19.2180905Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:19.2181879Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:19.2183126Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:19.2184241Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:19.2185066Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:19.2186327Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:19.2187767Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:19.2188914Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2189896Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2190702Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:19.2191853Z W0507 20:32:19.212000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.2319176Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:19.2321117Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:19.2322551Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:19.2324065Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:19.2325107Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:19.2326485Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:19.2328139Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.2329529Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:19.2330988Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.2332104Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:19.2333451Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:19.2334892Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:19.2335791Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:19.2337069Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:19.2338351Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:19.2339445Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:19.2340539Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:19.2341887Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:19.2343243Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:19.2344200Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:19.2345350Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:19.2346465Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:19.2347291Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:19.2348536Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:19.2349974Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:19.2351107Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.2352252Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.2353056Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:19.2354146Z W0507 20:32:19.230000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.6215512Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.6216274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.6216706Z     T=1,
2025-05-07T20:32:19.6216921Z     D=5120,
2025-05-07T20:32:19.6217122Z     scale_ub=None,
2025-05-07T20:32:19.6217350Z     contiguous=True,
2025-05-07T20:32:19.6217587Z     compiled=True,
2025-05-07T20:32:19.6217803Z )
2025-05-07T20:32:19.6218149Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.6219072Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:19.6219350Z 
2025-05-07T20:32:19.6219444Z     @given(
2025-05-07T20:32:19.6219690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.6220031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.6220358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.6220700Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.6221049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.6221354Z     )
2025-05-07T20:32:19.6221720Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.6222190Z     def test_silu_mul_quant(
2025-05-07T20:32:19.6222450Z         self,
2025-05-07T20:32:19.6222654Z         T: int,
2025-05-07T20:32:19.6222867Z         D: int,
2025-05-07T20:32:19.6223101Z         scale_ub: Optional[float],
2025-05-07T20:32:19.6223390Z         contiguous: bool,
2025-05-07T20:32:19.6223650Z         compiled: bool,
2025-05-07T20:32:19.6223902Z     ) -> None:
2025-05-07T20:32:19.6224125Z         torch.manual_seed(2025)
2025-05-07T20:32:19.6224383Z     
2025-05-07T20:32:19.6224673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.6225037Z     
2025-05-07T20:32:19.6225238Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.6225548Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.6225878Z         x = x_sign * x_clamp
2025-05-07T20:32:19.6226127Z         x0 = x[:, :D]
2025-05-07T20:32:19.6226356Z         x1 = x[:, D:]
2025-05-07T20:32:19.6226573Z     
2025-05-07T20:32:19.6226763Z         if contiguous:
2025-05-07T20:32:19.6227008Z             x0 = x0.contiguous()
2025-05-07T20:32:19.6227280Z             x1 = x1.contiguous()
2025-05-07T20:32:19.6227530Z     
2025-05-07T20:32:19.6227737Z         if scale_ub is not None:
2025-05-07T20:32:19.6228038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.6228391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.6228718Z             )
2025-05-07T20:32:19.6228924Z         else:
2025-05-07T20:32:19.6229140Z             scale_ub_tensor = None
2025-05-07T20:32:19.6229407Z     
2025-05-07T20:32:19.6229651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.6229976Z             op = silu_mul_quant
2025-05-07T20:32:19.6230243Z             if compiled:
2025-05-07T20:32:19.6230504Z                 op = torch.compile(op)
2025-05-07T20:32:19.6230828Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6231155Z     
2025-05-07T20:32:19.6231358Z         y_fp8, y_scale = fn()
2025-05-07T20:32:19.6231657Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:19.6231958Z     
2025-05-07T20:32:19.6232210Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.6232562Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:19.6232869Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:19.6233362Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:19.6233743Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.6234063Z     
2025-05-07T20:32:19.6234274Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:19.6234484Z 
2025-05-07T20:32:19.6234591Z moe/activation_test.py:126: 
2025-05-07T20:32:19.6234902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6235250Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:19.6235595Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:19.6236424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:19.6237200Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:19.6237776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.6238584Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.6239305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:19.6240056Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:19.6240948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:19.6241619Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:19.6242253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:19.6242791Z     fn()
2025-05-07T20:32:19.6243322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:19.6243937Z     self.fn.run(
2025-05-07T20:32:19.6244427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.6244982Z     kernel = self.compile(
2025-05-07T20:32:19.6245549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.6246234Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.6246650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6246902Z 
2025-05-07T20:32:19.6247117Z self = <triton.compiler.compiler.ASTSource object at 0x7f32948339d0>
2025-05-07T20:32:19.6248242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.6249697Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3294bd36a0>}
2025-05-07T20:32:19.6251085Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.6252144Z context = <triton._C.libtriton.ir.context object at 0x7f3294fe6eb0>
2025-05-07T20:32:19.6252451Z 
2025-05-07T20:32:19.6252625Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.6253171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.6253853Z                            module_map=module_map)
2025-05-07T20:32:19.6254241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.6254618Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:19.6254907Z E       ^
2025-05-07T20:32:19.6255488Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.6255968Z 
2025-05-07T20:32:19.6256406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:19.6256941Z 
2025-05-07T20:32:19.6257062Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:19.6257497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:19.6257926Z     T=2048,
2025-05-07T20:32:19.6258124Z     D=5120,
2025-05-07T20:32:19.6258321Z     scale_ub=1200.0,
2025-05-07T20:32:19.6258557Z     contiguous=True,
2025-05-07T20:32:19.6258792Z     compiled=False,
2025-05-07T20:32:19.6259012Z )
2025-05-07T20:32:19.6259344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:19.6259865Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:19.6260238Z 
2025-05-07T20:32:19.6260325Z     @given(
2025-05-07T20:32:19.6260567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:19.6260932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:19.6261284Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:19.6261628Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:19.6261977Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:19.6262281Z     )
2025-05-07T20:32:19.6262646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:19.6263109Z     def test_silu_mul_quant(
2025-05-07T20:32:19.6263366Z         self,
2025-05-07T20:32:19.6263574Z         T: int,
2025-05-07T20:32:19.6263778Z         D: int,
2025-05-07T20:32:19.6264013Z         scale_ub: Optional[float],
2025-05-07T20:32:19.6264299Z         contiguous: bool,
2025-05-07T20:32:19.6264560Z         compiled: bool,
2025-05-07T20:32:19.6264887Z     ) -> None:
2025-05-07T20:32:19.6265153Z         torch.manual_seed(2025)
2025-05-07T20:32:19.6265408Z     
2025-05-07T20:32:19.6265695Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:19.6266054Z     
2025-05-07T20:32:19.6266252Z         x_sign = torch.sign(x)
2025-05-07T20:32:19.6266558Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:19.6266883Z         x = x_sign * x_clamp
2025-05-07T20:32:19.6267131Z         x0 = x[:, :D]
2025-05-07T20:32:19.6267356Z         x1 = x[:, D:]
2025-05-07T20:32:19.6267573Z     
2025-05-07T20:32:19.6267764Z         if contiguous:
2025-05-07T20:32:19.6268007Z             x0 = x0.contiguous()
2025-05-07T20:32:19.6268278Z             x1 = x1.contiguous()
2025-05-07T20:32:19.6268530Z     
2025-05-07T20:32:19.6268726Z         if scale_ub is not None:
2025-05-07T20:32:19.6269013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:19.6269369Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:19.6269699Z             )
2025-05-07T20:32:19.6269905Z         else:
2025-05-07T20:32:19.6270134Z             scale_ub_tensor = None
2025-05-07T20:32:19.6270391Z     
2025-05-07T20:32:19.6270636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:19.6270968Z             op = silu_mul_quant
2025-05-07T20:32:19.6271225Z             if compiled:
2025-05-07T20:32:19.6271487Z                 op = torch.compile(op)
2025-05-07T20:32:19.6271803Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6272086Z     
2025-05-07T20:32:19.6272291Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:19.6272463Z 
2025-05-07T20:32:19.6272573Z moe/activation_test.py:117: 
2025-05-07T20:32:19.6272881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6273227Z moe/activation_test.py:115: in fn
2025-05-07T20:32:19.6273525Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:19.6274245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:19.6275056Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:19.6275626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:19.6276339Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:19.6277032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:19.6277582Z     kernel = self.compile(
2025-05-07T20:32:19.6278147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:19.6278832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:19.6279242Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:19.6279488Z 
2025-05-07T20:32:19.6279705Z self = <triton.compiler.compiler.ASTSource object at 0x7f3294bc1a70>
2025-05-07T20:32:19.6281050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:19.6282475Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3294845f80>}
2025-05-07T20:32:19.6283870Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:19.6284929Z context = <triton._C.libtriton.ir.context object at 0x7f3294394c30>
2025-05-07T20:32:19.6285234Z 
2025-05-07T20:32:19.6285411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:19.6285970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:19.6286464Z                            module_map=module_map)
2025-05-07T20:32:19.6286843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:19.6287217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:19.6287492Z E       ^
2025-05-07T20:32:19.6287977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:19.6288454Z 
2025-05-07T20:32:19.6288888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.0199168Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.0200788Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:20.0202287Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.0203804Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.0204833Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.0206212Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.0207996Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.0209386Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.0210838Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.0211985Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:20.0213650Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.0215147Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:20.0216034Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.0217291Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.0218550Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:20.0219637Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.0220717Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:20.0221995Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.0223327Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.0224274Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.0225418Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.0226516Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:20.0227324Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.0228556Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.0229970Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.0231114Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.0232225Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.0233010Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:20.0234088Z W0507 20:32:20.015000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.0975684Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.0977138Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:20.0978548Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.0980221Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.0981248Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.0982617Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.0984060Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.0985433Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.0986873Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.0987967Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:20.0989291Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.0990592Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:20.0991543Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.0992808Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.0994070Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:20.0995157Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.0996232Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:20.0997629Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.0998970Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.0999920Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.1001152Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.1002243Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:20.1003152Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.1004377Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.1005816Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.1006924Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.1007885Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.1008677Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:20.1009762Z W0507 20:32:20.094000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6717777Z 
2025-05-07T20:32:20.6718508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.6719295Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.6719901Z     T=2048,
2025-05-07T20:32:20.6720206Z     D=5120,
2025-05-07T20:32:20.6720409Z     scale_ub=1200.0,
2025-05-07T20:32:20.6720638Z     contiguous=True,
2025-05-07T20:32:20.6720869Z     compiled=True,
2025-05-07T20:32:20.6721088Z )
2025-05-07T20:32:20.6721456Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.6721994Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:20.6731024Z 
2025-05-07T20:32:20.6731151Z     @given(
2025-05-07T20:32:20.6731418Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.6731756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.6732075Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.6732426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.6732780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.6733085Z     )
2025-05-07T20:32:20.6733449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.6733916Z     def test_silu_mul_quant(
2025-05-07T20:32:20.6734175Z         self,
2025-05-07T20:32:20.6734380Z         T: int,
2025-05-07T20:32:20.6734596Z         D: int,
2025-05-07T20:32:20.6734827Z         scale_ub: Optional[float],
2025-05-07T20:32:20.6735110Z         contiguous: bool,
2025-05-07T20:32:20.6735365Z         compiled: bool,
2025-05-07T20:32:20.6735607Z     ) -> None:
2025-05-07T20:32:20.6735833Z         torch.manual_seed(2025)
2025-05-07T20:32:20.6736098Z     
2025-05-07T20:32:20.6736719Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.6737084Z     
2025-05-07T20:32:20.6737293Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.6737601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.6737931Z         x = x_sign * x_clamp
2025-05-07T20:32:20.6738180Z         x0 = x[:, :D]
2025-05-07T20:32:20.6738412Z         x1 = x[:, D:]
2025-05-07T20:32:20.6738636Z     
2025-05-07T20:32:20.6738827Z         if contiguous:
2025-05-07T20:32:20.6739073Z             x0 = x0.contiguous()
2025-05-07T20:32:20.6739349Z             x1 = x1.contiguous()
2025-05-07T20:32:20.6739597Z     
2025-05-07T20:32:20.6739808Z         if scale_ub is not None:
2025-05-07T20:32:20.6740102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.6740453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.6740787Z             )
2025-05-07T20:32:20.6741147Z         else:
2025-05-07T20:32:20.6741366Z             scale_ub_tensor = None
2025-05-07T20:32:20.6741641Z     
2025-05-07T20:32:20.6741890Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6742216Z             op = silu_mul_quant
2025-05-07T20:32:20.6742482Z             if compiled:
2025-05-07T20:32:20.6742748Z                 op = torch.compile(op)
2025-05-07T20:32:20.6743063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6743348Z     
2025-05-07T20:32:20.6743553Z         y_fp8, y_scale = fn()
2025-05-07T20:32:20.6743858Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:20.6744159Z     
2025-05-07T20:32:20.6744413Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6744770Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:20.6745075Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:20.6745409Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:20.6745801Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6746127Z     
2025-05-07T20:32:20.6746346Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:20.6746555Z 
2025-05-07T20:32:20.6746674Z moe/activation_test.py:126: 
2025-05-07T20:32:20.6746991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6747344Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:20.6747694Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:20.6748531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:20.6749535Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:20.6750261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6751093Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6751916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:20.6752666Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:20.6753435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:20.6754109Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:20.6754746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:20.6755283Z     fn()
2025-05-07T20:32:20.6755814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:20.6756423Z     self.fn.run(
2025-05-07T20:32:20.6756906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.6757466Z     kernel = self.compile(
2025-05-07T20:32:20.6758132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.6758819Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.6759232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6759478Z 
2025-05-07T20:32:20.6759695Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ef0ee00>
2025-05-07T20:32:20.6760929Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.6762440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32944c9d00>}
2025-05-07T20:32:20.6763939Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.6765016Z context = <triton._C.libtriton.ir.context object at 0x7f328ed9f5b0>
2025-05-07T20:32:20.6765330Z 
2025-05-07T20:32:20.6765504Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.6766058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.6766548Z                            module_map=module_map)
2025-05-07T20:32:20.6766937Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.6767315Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:20.6767590Z E       ^
2025-05-07T20:32:20.6768081Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6768570Z 
2025-05-07T20:32:20.6769010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.6769549Z 
2025-05-07T20:32:20.6769667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:20.6770099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:20.6770525Z     T=16384,
2025-05-07T20:32:20.6770730Z     D=7168,
2025-05-07T20:32:20.6770929Z     scale_ub=1200.0,
2025-05-07T20:32:20.6771168Z     contiguous=False,
2025-05-07T20:32:20.6771410Z     compiled=False,
2025-05-07T20:32:20.6771620Z )
2025-05-07T20:32:20.6771958Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:20.6772496Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:20.6772795Z 
2025-05-07T20:32:20.6772885Z     @given(
2025-05-07T20:32:20.6773124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:20.6773459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:20.6773796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:20.6774144Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:20.6774498Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:20.6774803Z     )
2025-05-07T20:32:20.6775167Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:20.6775635Z     def test_silu_mul_quant(
2025-05-07T20:32:20.6775895Z         self,
2025-05-07T20:32:20.6776103Z         T: int,
2025-05-07T20:32:20.6776307Z         D: int,
2025-05-07T20:32:20.6776542Z         scale_ub: Optional[float],
2025-05-07T20:32:20.6776839Z         contiguous: bool,
2025-05-07T20:32:20.6777087Z         compiled: bool,
2025-05-07T20:32:20.6777325Z     ) -> None:
2025-05-07T20:32:20.6777557Z         torch.manual_seed(2025)
2025-05-07T20:32:20.6777812Z     
2025-05-07T20:32:20.6778108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:20.6778482Z     
2025-05-07T20:32:20.6778681Z         x_sign = torch.sign(x)
2025-05-07T20:32:20.6779125Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:20.6779458Z         x = x_sign * x_clamp
2025-05-07T20:32:20.6779707Z         x0 = x[:, :D]
2025-05-07T20:32:20.6779938Z         x1 = x[:, D:]
2025-05-07T20:32:20.6780160Z     
2025-05-07T20:32:20.6780355Z         if contiguous:
2025-05-07T20:32:20.6780603Z             x0 = x0.contiguous()
2025-05-07T20:32:20.6780878Z             x1 = x1.contiguous()
2025-05-07T20:32:20.6781136Z     
2025-05-07T20:32:20.6781366Z         if scale_ub is not None:
2025-05-07T20:32:20.6781668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:20.6782029Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:20.6782349Z             )
2025-05-07T20:32:20.6782553Z         else:
2025-05-07T20:32:20.6782778Z             scale_ub_tensor = None
2025-05-07T20:32:20.6783036Z     
2025-05-07T20:32:20.6783365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:20.6783705Z             op = silu_mul_quant
2025-05-07T20:32:20.6783964Z             if compiled:
2025-05-07T20:32:20.6784225Z                 op = torch.compile(op)
2025-05-07T20:32:20.6784538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6784823Z     
2025-05-07T20:32:20.6785033Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:20.6785208Z 
2025-05-07T20:32:20.6785318Z moe/activation_test.py:117: 
2025-05-07T20:32:20.6785632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6785978Z moe/activation_test.py:115: in fn
2025-05-07T20:32:20.6786281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:20.6787014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:20.6787732Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:20.6788300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:20.6789037Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:20.6789741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:20.6790296Z     kernel = self.compile(
2025-05-07T20:32:20.6790873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:20.6791613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.6792053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:20.6792293Z 
2025-05-07T20:32:20.6792512Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ef0e030>
2025-05-07T20:32:20.6793646Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:20.6795091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32944a4e00>}
2025-05-07T20:32:20.6796490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:20.6797561Z context = <triton._C.libtriton.ir.context object at 0x7f328ebb1470>
2025-05-07T20:32:20.6797864Z 
2025-05-07T20:32:20.6798042Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:20.6798582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.6799060Z                            module_map=module_map)
2025-05-07T20:32:20.6799443Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.6799909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.6800281Z E       ^
2025-05-07T20:32:20.6800776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.6801251Z 
2025-05-07T20:32:20.6801697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:20.9046354Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.9047502Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:20.9048926Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.9050609Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.9051631Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.9052994Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.9054428Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.9055796Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.9057228Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.9058323Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:20.9059643Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.9060942Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:20.9061825Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.9063080Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.9064339Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:20.9065422Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.9066486Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:20.9067866Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.9069198Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.9070137Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.9071269Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.9072399Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:20.9073289Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.9074507Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.9075910Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.9077012Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.9077954Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.9078733Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:20.9079812Z W0507 20:32:20.900000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:20.9595325Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:20.9597506Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:20.9600361Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:20.9602235Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:20.9603260Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:20.9604619Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:20.9606053Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:20.9607416Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:20.9608996Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:20.9610098Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:20.9611416Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:20.9612711Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:20.9613756Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.9615136Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:20.9616397Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:20.9617479Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:20.9618546Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:20.9619815Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:20.9621163Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:20.9622158Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:20.9623297Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:20.9624389Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:20.9625190Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:20.9626420Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:20.9627836Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:20.9628946Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:20.9629903Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:20.9630678Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:20.9631875Z W0507 20:32:20.955000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.4075095Z 
2025-05-07T20:32:21.4075283Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.4075739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.4076199Z     T=1,
2025-05-07T20:32:21.4076402Z     D=7168,
2025-05-07T20:32:21.4076631Z     scale_ub=None,
2025-05-07T20:32:21.4077005Z     contiguous=True,
2025-05-07T20:32:21.4077337Z     compiled=True,
2025-05-07T20:32:21.4077622Z )
2025-05-07T20:32:21.4078075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.4078585Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:21.4078856Z 
2025-05-07T20:32:21.4078938Z     @given(
2025-05-07T20:32:21.4079183Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.4079513Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.4080020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.4080459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.4080809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.4081113Z     )
2025-05-07T20:32:21.4081477Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.4081944Z     def test_silu_mul_quant(
2025-05-07T20:32:21.4082203Z         self,
2025-05-07T20:32:21.4082404Z         T: int,
2025-05-07T20:32:21.4082616Z         D: int,
2025-05-07T20:32:21.4082848Z         scale_ub: Optional[float],
2025-05-07T20:32:21.4083129Z         contiguous: bool,
2025-05-07T20:32:21.4083388Z         compiled: bool,
2025-05-07T20:32:21.4083627Z     ) -> None:
2025-05-07T20:32:21.4083853Z         torch.manual_seed(2025)
2025-05-07T20:32:21.4084112Z     
2025-05-07T20:32:21.4084405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.4084766Z     
2025-05-07T20:32:21.4084973Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.4085288Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.4085621Z         x = x_sign * x_clamp
2025-05-07T20:32:21.4085869Z         x0 = x[:, :D]
2025-05-07T20:32:21.4086100Z         x1 = x[:, D:]
2025-05-07T20:32:21.4086321Z     
2025-05-07T20:32:21.4086514Z         if contiguous:
2025-05-07T20:32:21.4086761Z             x0 = x0.contiguous()
2025-05-07T20:32:21.4087033Z             x1 = x1.contiguous()
2025-05-07T20:32:21.4087281Z     
2025-05-07T20:32:21.4087490Z         if scale_ub is not None:
2025-05-07T20:32:21.4087812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.4088166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.4088494Z             )
2025-05-07T20:32:21.4088692Z         else:
2025-05-07T20:32:21.4088921Z             scale_ub_tensor = None
2025-05-07T20:32:21.4089186Z     
2025-05-07T20:32:21.4089426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.4089763Z             op = silu_mul_quant
2025-05-07T20:32:21.4090037Z             if compiled:
2025-05-07T20:32:21.4090293Z                 op = torch.compile(op)
2025-05-07T20:32:21.4090607Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.4090902Z     
2025-05-07T20:32:21.4091102Z         y_fp8, y_scale = fn()
2025-05-07T20:32:21.4091403Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:21.4091715Z     
2025-05-07T20:32:21.4091967Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.4092316Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:21.4092628Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:21.4092960Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:21.4093335Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.4093667Z     
2025-05-07T20:32:21.4093881Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:21.4094087Z 
2025-05-07T20:32:21.4094200Z moe/activation_test.py:126: 
2025-05-07T20:32:21.4094671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.4095035Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:21.4095384Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:21.4096205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:21.4096992Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:21.4097570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.4098285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.4099013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:21.4099781Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:21.4100636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:21.4101309Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:21.4101999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:21.4102545Z     fn()
2025-05-07T20:32:21.4103077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:21.4103683Z     self.fn.run(
2025-05-07T20:32:21.4104178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.4104735Z     kernel = self.compile(
2025-05-07T20:32:21.4105300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.4106001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.4106427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.4106669Z 
2025-05-07T20:32:21.4106892Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f2a3c50>
2025-05-07T20:32:21.4108017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.4109447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328ecd3ec0>}
2025-05-07T20:32:21.4110840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.4111914Z context = <triton._C.libtriton.ir.context object at 0x7f328e7afbf0>
2025-05-07T20:32:21.4112214Z 
2025-05-07T20:32:21.4112396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.4112941Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.4113621Z                            module_map=module_map)
2025-05-07T20:32:21.4114008Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.4114382Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:21.4114664Z E       ^
2025-05-07T20:32:21.4115154Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.4115625Z 
2025-05-07T20:32:21.4116067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.4116601Z 
2025-05-07T20:32:21.4116718Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:21.4117278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:21.4117703Z     T=4096,
2025-05-07T20:32:21.4117901Z     D=5120,
2025-05-07T20:32:21.4118108Z     scale_ub=None,
2025-05-07T20:32:21.4118340Z     contiguous=False,
2025-05-07T20:32:21.4118576Z     compiled=False,
2025-05-07T20:32:21.4118799Z )
2025-05-07T20:32:21.4119139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:21.4119658Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:21.4119954Z 
2025-05-07T20:32:21.4120036Z     @given(
2025-05-07T20:32:21.4120347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:21.4120678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:21.4121001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:21.4121355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:21.4121836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:21.4122139Z     )
2025-05-07T20:32:21.4122521Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:21.4122986Z     def test_silu_mul_quant(
2025-05-07T20:32:21.4123235Z         self,
2025-05-07T20:32:21.4123441Z         T: int,
2025-05-07T20:32:21.4123650Z         D: int,
2025-05-07T20:32:21.4123879Z         scale_ub: Optional[float],
2025-05-07T20:32:21.4124163Z         contiguous: bool,
2025-05-07T20:32:21.4124419Z         compiled: bool,
2025-05-07T20:32:21.4124650Z     ) -> None:
2025-05-07T20:32:21.4124885Z         torch.manual_seed(2025)
2025-05-07T20:32:21.4125138Z     
2025-05-07T20:32:21.4125420Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:21.4125781Z     
2025-05-07T20:32:21.4125990Z         x_sign = torch.sign(x)
2025-05-07T20:32:21.4126295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:21.4126626Z         x = x_sign * x_clamp
2025-05-07T20:32:21.4126881Z         x0 = x[:, :D]
2025-05-07T20:32:21.4127115Z         x1 = x[:, D:]
2025-05-07T20:32:21.4127345Z     
2025-05-07T20:32:21.4127536Z         if contiguous:
2025-05-07T20:32:21.4127781Z             x0 = x0.contiguous()
2025-05-07T20:32:21.4128057Z             x1 = x1.contiguous()
2025-05-07T20:32:21.4128306Z     
2025-05-07T20:32:21.4128512Z         if scale_ub is not None:
2025-05-07T20:32:21.4135278Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:21.4135679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:21.4136015Z             )
2025-05-07T20:32:21.4136231Z         else:
2025-05-07T20:32:21.4136460Z             scale_ub_tensor = None
2025-05-07T20:32:21.4136729Z     
2025-05-07T20:32:21.4136980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:21.4137322Z             op = silu_mul_quant
2025-05-07T20:32:21.4137586Z             if compiled:
2025-05-07T20:32:21.4137852Z                 op = torch.compile(op)
2025-05-07T20:32:21.4138183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.4138469Z     
2025-05-07T20:32:21.4138681Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:21.4138864Z 
2025-05-07T20:32:21.4138971Z moe/activation_test.py:117: 
2025-05-07T20:32:21.4139289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.4139640Z moe/activation_test.py:115: in fn
2025-05-07T20:32:21.4139942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:21.4140674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:21.4141391Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:21.4141962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:21.4142685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:21.4143504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:21.4144066Z     kernel = self.compile(
2025-05-07T20:32:21.4144640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:21.4145334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.4145754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:21.4146003Z 
2025-05-07T20:32:21.4146222Z self = <triton.compiler.compiler.ASTSource object at 0x7f329476a4e0>
2025-05-07T20:32:21.4147350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:21.4148784Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb49bc0>}
2025-05-07T20:32:21.4150265Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:21.4151321Z context = <triton._C.libtriton.ir.context object at 0x7f31f9ea7e30>
2025-05-07T20:32:21.4151627Z 
2025-05-07T20:32:21.4151805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:21.4152356Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.4152851Z                            module_map=module_map)
2025-05-07T20:32:21.4153229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.4153601Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.4153877Z E       ^
2025-05-07T20:32:21.4154368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.4154846Z 
2025-05-07T20:32:21.4155282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:21.6998360Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.6999489Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:21.7000974Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.7002517Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.7003555Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.7004933Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.7006379Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.7007751Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.7009351Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.7010462Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:21.7011803Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.7013121Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:21.7014159Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.7015556Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.7016831Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:21.7017926Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.7019009Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:21.7020293Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.7021647Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.7022608Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.7023760Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.7024859Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:21.7025677Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.7026920Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.7028354Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.7029477Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.7030448Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.7031239Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:21.7032379Z W0507 20:32:21.696000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:21.8846675Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:21.8847806Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:21.8849206Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:21.8850693Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:21.8851717Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:21.8853203Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:21.8854642Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:21.8856006Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:21.8857448Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:21.8858551Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:21.8859872Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:21.8861170Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:21.8862110Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.8863376Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:21.8864638Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:21.8865726Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:21.8866804Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:21.8868083Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:21.8869410Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:21.8870447Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:21.8871590Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:21.8872731Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:21.8873544Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:21.8874766Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:21.8876292Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:21.8877399Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:21.8878356Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:21.8879132Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:21.8880307Z W0507 20:32:21.880000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4142805Z 
2025-05-07T20:32:22.4142980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.4143481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.4144108Z     T=4096,
2025-05-07T20:32:22.4144396Z     D=7168,
2025-05-07T20:32:22.4144650Z     scale_ub=None,
2025-05-07T20:32:22.4144940Z     contiguous=False,
2025-05-07T20:32:22.4145242Z     compiled=False,
2025-05-07T20:32:22.4145489Z )
2025-05-07T20:32:22.4145827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.4146354Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.4146640Z 
2025-05-07T20:32:22.4146720Z     @given(
2025-05-07T20:32:22.4146965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.4147297Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.4147614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.4147964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.4148311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.4148623Z     )
2025-05-07T20:32:22.4148993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.4149461Z     def test_silu_mul_quant(
2025-05-07T20:32:22.4149719Z         self,
2025-05-07T20:32:22.4149917Z         T: int,
2025-05-07T20:32:22.4150131Z         D: int,
2025-05-07T20:32:22.4150366Z         scale_ub: Optional[float],
2025-05-07T20:32:22.4150653Z         contiguous: bool,
2025-05-07T20:32:22.4150911Z         compiled: bool,
2025-05-07T20:32:22.4151150Z     ) -> None:
2025-05-07T20:32:22.4151379Z         torch.manual_seed(2025)
2025-05-07T20:32:22.4151638Z     
2025-05-07T20:32:22.4151966Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.4152344Z     
2025-05-07T20:32:22.4152553Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.4152868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.4153204Z         x = x_sign * x_clamp
2025-05-07T20:32:22.4153468Z         x0 = x[:, :D]
2025-05-07T20:32:22.4153696Z         x1 = x[:, D:]
2025-05-07T20:32:22.4154089Z     
2025-05-07T20:32:22.4154287Z         if contiguous:
2025-05-07T20:32:22.4154535Z             x0 = x0.contiguous()
2025-05-07T20:32:22.4154811Z             x1 = x1.contiguous()
2025-05-07T20:32:22.4155065Z     
2025-05-07T20:32:22.4155274Z         if scale_ub is not None:
2025-05-07T20:32:22.4155567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.4155926Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.4156263Z             )
2025-05-07T20:32:22.4156478Z         else:
2025-05-07T20:32:22.4156696Z             scale_ub_tensor = None
2025-05-07T20:32:22.4156967Z     
2025-05-07T20:32:22.4157204Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4157531Z             op = silu_mul_quant
2025-05-07T20:32:22.4157795Z             if compiled:
2025-05-07T20:32:22.4158057Z                 op = torch.compile(op)
2025-05-07T20:32:22.4158494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4158784Z     
2025-05-07T20:32:22.4158996Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.4159170Z 
2025-05-07T20:32:22.4159276Z moe/activation_test.py:117: 
2025-05-07T20:32:22.4159587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4159936Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.4160326Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4161047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.4161768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.4162340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.4163053Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.4163750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.4164320Z     kernel = self.compile(
2025-05-07T20:32:22.4164894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.4165578Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.4165999Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4166240Z 
2025-05-07T20:32:22.4166463Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ece5010>
2025-05-07T20:32:22.4167595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.4169025Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb4a340>}
2025-05-07T20:32:22.4170434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.4171500Z context = <triton._C.libtriton.ir.context object at 0x7f328e87b6f0>
2025-05-07T20:32:22.4171804Z 
2025-05-07T20:32:22.4171987Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.4172532Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.4173030Z                            module_map=module_map)
2025-05-07T20:32:22.4173424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.4173798Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.4174071Z E       ^
2025-05-07T20:32:22.4174565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4175040Z 
2025-05-07T20:32:22.4175566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.4176106Z 
2025-05-07T20:32:22.4176228Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.4176663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.4177089Z     T=128,
2025-05-07T20:32:22.4177290Z     D=7168,
2025-05-07T20:32:22.4177492Z     scale_ub=None,
2025-05-07T20:32:22.4177722Z     contiguous=False,
2025-05-07T20:32:22.4177961Z     compiled=True,
2025-05-07T20:32:22.4178169Z )
2025-05-07T20:32:22.4178507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.4179030Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:22.4179312Z 
2025-05-07T20:32:22.4179393Z     @given(
2025-05-07T20:32:22.4179643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.4180064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.4180390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.4180738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.4181089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.4181395Z     )
2025-05-07T20:32:22.4181760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.4182225Z     def test_silu_mul_quant(
2025-05-07T20:32:22.4182480Z         self,
2025-05-07T20:32:22.4182679Z         T: int,
2025-05-07T20:32:22.4182890Z         D: int,
2025-05-07T20:32:22.4183121Z         scale_ub: Optional[float],
2025-05-07T20:32:22.4183402Z         contiguous: bool,
2025-05-07T20:32:22.4183655Z         compiled: bool,
2025-05-07T20:32:22.4183892Z     ) -> None:
2025-05-07T20:32:22.4184115Z         torch.manual_seed(2025)
2025-05-07T20:32:22.4184374Z     
2025-05-07T20:32:22.4184668Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.4185023Z     
2025-05-07T20:32:22.4185226Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.4185535Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.4185862Z         x = x_sign * x_clamp
2025-05-07T20:32:22.4186110Z         x0 = x[:, :D]
2025-05-07T20:32:22.4186337Z         x1 = x[:, D:]
2025-05-07T20:32:22.4186554Z     
2025-05-07T20:32:22.4186744Z         if contiguous:
2025-05-07T20:32:22.4186994Z             x0 = x0.contiguous()
2025-05-07T20:32:22.4187266Z             x1 = x1.contiguous()
2025-05-07T20:32:22.4187514Z     
2025-05-07T20:32:22.4187714Z         if scale_ub is not None:
2025-05-07T20:32:22.4188004Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.4188355Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.4188681Z             )
2025-05-07T20:32:22.4188882Z         else:
2025-05-07T20:32:22.4189099Z             scale_ub_tensor = None
2025-05-07T20:32:22.4189365Z     
2025-05-07T20:32:22.4189612Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4189937Z             op = silu_mul_quant
2025-05-07T20:32:22.4190203Z             if compiled:
2025-05-07T20:32:22.4190465Z                 op = torch.compile(op)
2025-05-07T20:32:22.4190791Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.4191078Z     
2025-05-07T20:32:22.4191280Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.4191585Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.4191888Z     
2025-05-07T20:32:22.4192144Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.4192496Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.4192805Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.4193134Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.4193517Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.4193846Z     
2025-05-07T20:32:22.4194053Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.4194377Z 
2025-05-07T20:32:22.4194485Z moe/activation_test.py:126: 
2025-05-07T20:32:22.4194800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4195150Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.4195496Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.4196320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.4197102Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.4197670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.4198386Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.4199111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.4199961Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.4200799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.4201475Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.4202115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.4202651Z     fn()
2025-05-07T20:32:22.4203184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.4203793Z     self.fn.run(
2025-05-07T20:32:22.4204288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.4204837Z     kernel = self.compile(
2025-05-07T20:32:22.4205415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.4206104Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.4206525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.4206773Z 
2025-05-07T20:32:22.4206992Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e1547d0>
2025-05-07T20:32:22.4208123Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.4209551Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328f23e840>}
2025-05-07T20:32:22.4210952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.4212017Z context = <triton._C.libtriton.ir.context object at 0x7f328e429770>
2025-05-07T20:32:22.4212323Z 
2025-05-07T20:32:22.4212502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.4213052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.4213734Z                            module_map=module_map)
2025-05-07T20:32:22.4214117Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.4214492Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.4214775Z E       ^
2025-05-07T20:32:22.4215259Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.4215734Z 
2025-05-07T20:32:22.4216171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6598277Z 
2025-05-07T20:32:22.6598541Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6599063Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6599654Z     T=128,
2025-05-07T20:32:22.6599861Z     D=7168,
2025-05-07T20:32:22.6600185Z     scale_ub=None,
2025-05-07T20:32:22.6600514Z     contiguous=False,
2025-05-07T20:32:22.6600845Z     compiled=False,
2025-05-07T20:32:22.6601134Z )
2025-05-07T20:32:22.6601604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6602128Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:22.6602463Z 
2025-05-07T20:32:22.6602542Z     @given(
2025-05-07T20:32:22.6602782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6603103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6603421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6603976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6604326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6604629Z     )
2025-05-07T20:32:22.6605006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6605466Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6605719Z         self,
2025-05-07T20:32:22.6605919Z         T: int,
2025-05-07T20:32:22.6606118Z         D: int,
2025-05-07T20:32:22.6606349Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6606635Z         contiguous: bool,
2025-05-07T20:32:22.6606887Z         compiled: bool,
2025-05-07T20:32:22.6607115Z     ) -> None:
2025-05-07T20:32:22.6607350Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6607604Z     
2025-05-07T20:32:22.6607884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6608242Z     
2025-05-07T20:32:22.6608442Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6608758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6609082Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6609335Z         x0 = x[:, :D]
2025-05-07T20:32:22.6609559Z         x1 = x[:, D:]
2025-05-07T20:32:22.6609772Z     
2025-05-07T20:32:22.6609965Z         if contiguous:
2025-05-07T20:32:22.6610200Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6610469Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6610720Z     
2025-05-07T20:32:22.6610914Z         if scale_ub is not None:
2025-05-07T20:32:22.6611203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6611557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6611879Z             )
2025-05-07T20:32:22.6612076Z         else:
2025-05-07T20:32:22.6612291Z             scale_ub_tensor = None
2025-05-07T20:32:22.6612553Z     
2025-05-07T20:32:22.6612788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6613126Z             op = silu_mul_quant
2025-05-07T20:32:22.6613634Z             if compiled:
2025-05-07T20:32:22.6613902Z                 op = torch.compile(op)
2025-05-07T20:32:22.6614215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6614508Z     
2025-05-07T20:32:22.6614706Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.6614886Z 
2025-05-07T20:32:22.6614996Z moe/activation_test.py:117: 
2025-05-07T20:32:22.6615308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6615659Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.6615951Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6616684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.6617414Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.6617981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6618852Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6619560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6620126Z     kernel = self.compile(
2025-05-07T20:32:22.6620698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6621399Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6621850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6622119Z 
2025-05-07T20:32:22.6622335Z self = <triton.compiler.compiler.ASTSource object at 0x7f328efaac90>
2025-05-07T20:32:22.6623483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6625099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb55f80>}
2025-05-07T20:32:22.6626521Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6627597Z context = <triton._C.libtriton.ir.context object at 0x7f328e468e70>
2025-05-07T20:32:22.6627901Z 
2025-05-07T20:32:22.6628077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6628630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6629125Z                            module_map=module_map)
2025-05-07T20:32:22.6629514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6629889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.6630160Z E       ^
2025-05-07T20:32:22.6630654Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6631131Z 
2025-05-07T20:32:22.6631572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6632173Z 
2025-05-07T20:32:22.6632282Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6632715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6633138Z     T=4096,
2025-05-07T20:32:22.6633330Z     D=5120,
2025-05-07T20:32:22.6633531Z     scale_ub=1200.0,
2025-05-07T20:32:22.6633764Z     contiguous=True,
2025-05-07T20:32:22.6640753Z     compiled=False,
2025-05-07T20:32:22.6641005Z )
2025-05-07T20:32:22.6641339Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6641886Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:22.6642231Z 
2025-05-07T20:32:22.6642312Z     @given(
2025-05-07T20:32:22.6642555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6642876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6643200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6643548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6643890Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6644189Z     )
2025-05-07T20:32:22.6644556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6645021Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6645275Z         self,
2025-05-07T20:32:22.6645480Z         T: int,
2025-05-07T20:32:22.6645687Z         D: int,
2025-05-07T20:32:22.6645916Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6646202Z         contiguous: bool,
2025-05-07T20:32:22.6646456Z         compiled: bool,
2025-05-07T20:32:22.6646684Z     ) -> None:
2025-05-07T20:32:22.6647047Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6647301Z     
2025-05-07T20:32:22.6647579Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6647937Z     
2025-05-07T20:32:22.6648140Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6648441Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6648764Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6649014Z         x0 = x[:, :D]
2025-05-07T20:32:22.6649233Z         x1 = x[:, D:]
2025-05-07T20:32:22.6649449Z     
2025-05-07T20:32:22.6649640Z         if contiguous:
2025-05-07T20:32:22.6649876Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6650143Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6650395Z     
2025-05-07T20:32:22.6650588Z         if scale_ub is not None:
2025-05-07T20:32:22.6650877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6651225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6651638Z             )
2025-05-07T20:32:22.6651862Z         else:
2025-05-07T20:32:22.6652099Z             scale_ub_tensor = None
2025-05-07T20:32:22.6652358Z     
2025-05-07T20:32:22.6652596Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6652921Z             op = silu_mul_quant
2025-05-07T20:32:22.6653179Z             if compiled:
2025-05-07T20:32:22.6653428Z                 op = torch.compile(op)
2025-05-07T20:32:22.6653733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6654024Z     
2025-05-07T20:32:22.6654217Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:22.6654392Z 
2025-05-07T20:32:22.6654494Z moe/activation_test.py:117: 
2025-05-07T20:32:22.6654797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6655150Z moe/activation_test.py:115: in fn
2025-05-07T20:32:22.6655436Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6656166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:22.6656882Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:22.6657430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6658135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6658815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6659362Z     kernel = self.compile(
2025-05-07T20:32:22.6659914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6660595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6661004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6661243Z 
2025-05-07T20:32:22.6661459Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecd94f0>
2025-05-07T20:32:22.6662574Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6663989Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb54fe0>}
2025-05-07T20:32:22.6665372Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6666430Z context = <triton._C.libtriton.ir.context object at 0x7f328e47c830>
2025-05-07T20:32:22.6666725Z 
2025-05-07T20:32:22.6666897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6667524Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6668009Z                            module_map=module_map)
2025-05-07T20:32:22.6668385Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6668745Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.6669012Z E       ^
2025-05-07T20:32:22.6669490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6669954Z 
2025-05-07T20:32:22.6670385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.6670918Z 
2025-05-07T20:32:22.6671024Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:22.6671456Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:22.6671877Z     T=1,
2025-05-07T20:32:22.6672138Z     D=5120,
2025-05-07T20:32:22.6672337Z     scale_ub=None,
2025-05-07T20:32:22.6672565Z     contiguous=True,
2025-05-07T20:32:22.6672790Z     compiled=True,
2025-05-07T20:32:22.6672996Z )
2025-05-07T20:32:22.6673321Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:22.6673818Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:22.6674089Z 
2025-05-07T20:32:22.6674168Z     @given(
2025-05-07T20:32:22.6674409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:22.6674729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:22.6675045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:22.6675385Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:22.6675728Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:22.6676017Z     )
2025-05-07T20:32:22.6676382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:22.6676843Z     def test_silu_mul_quant(
2025-05-07T20:32:22.6677092Z         self,
2025-05-07T20:32:22.6677290Z         T: int,
2025-05-07T20:32:22.6677492Z         D: int,
2025-05-07T20:32:22.6677713Z         scale_ub: Optional[float],
2025-05-07T20:32:22.6677994Z         contiguous: bool,
2025-05-07T20:32:22.6678242Z         compiled: bool,
2025-05-07T20:32:22.6678466Z     ) -> None:
2025-05-07T20:32:22.6678688Z         torch.manual_seed(2025)
2025-05-07T20:32:22.6678936Z     
2025-05-07T20:32:22.6679208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:22.6679561Z     
2025-05-07T20:32:22.6679762Z         x_sign = torch.sign(x)
2025-05-07T20:32:22.6680064Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:22.6680477Z         x = x_sign * x_clamp
2025-05-07T20:32:22.6680728Z         x0 = x[:, :D]
2025-05-07T20:32:22.6680950Z         x1 = x[:, D:]
2025-05-07T20:32:22.6681156Z     
2025-05-07T20:32:22.6681345Z         if contiguous:
2025-05-07T20:32:22.6681594Z             x0 = x0.contiguous()
2025-05-07T20:32:22.6681877Z             x1 = x1.contiguous()
2025-05-07T20:32:22.6682158Z     
2025-05-07T20:32:22.6682357Z         if scale_ub is not None:
2025-05-07T20:32:22.6682637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:22.6682983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:22.6683301Z             )
2025-05-07T20:32:22.6683497Z         else:
2025-05-07T20:32:22.6683710Z             scale_ub_tensor = None
2025-05-07T20:32:22.6683972Z     
2025-05-07T20:32:22.6684205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6684526Z             op = silu_mul_quant
2025-05-07T20:32:22.6684787Z             if compiled:
2025-05-07T20:32:22.6685042Z                 op = torch.compile(op)
2025-05-07T20:32:22.6685338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:22.6685620Z     
2025-05-07T20:32:22.6685818Z         y_fp8, y_scale = fn()
2025-05-07T20:32:22.6686111Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:22.6686493Z     
2025-05-07T20:32:22.6686740Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:22.6687085Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:22.6687384Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:22.6687710Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:22.6688076Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.6688396Z     
2025-05-07T20:32:22.6688602Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:22.6688803Z 
2025-05-07T20:32:22.6688910Z moe/activation_test.py:126: 
2025-05-07T20:32:22.6689211Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6689558Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:22.6689901Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:22.6690715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:22.6691722Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:22.6692379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:22.6693202Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:22.6694032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:22.6694908Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:22.6695804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:22.6696569Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:22.6697295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:22.6697926Z     fn()
2025-05-07T20:32:22.6698530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:22.6699231Z     self.fn.run(
2025-05-07T20:32:22.6699780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:22.6700419Z     kernel = self.compile(
2025-05-07T20:32:22.6701056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:22.6701848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.6702357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:22.6702623Z 
2025-05-07T20:32:22.6702859Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecd9910>
2025-05-07T20:32:22.6704206Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:22.6705938Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb57a60>}
2025-05-07T20:32:22.6707626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:22.6708889Z context = <triton._C.libtriton.ir.context object at 0x7f31f99a8830>
2025-05-07T20:32:22.6709230Z 
2025-05-07T20:32:22.6709419Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:22.6710027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.6710581Z                            module_map=module_map)
2025-05-07T20:32:22.6711073Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.6711443Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:22.6711717Z E       ^
2025-05-07T20:32:22.6712225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.6712713Z 
2025-05-07T20:32:22.6713146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:22.8955770Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:22.8956891Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:22.8958508Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:22.8959993Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:22.8961113Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:22.8962475Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:22.8963915Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.8965279Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:22.8966715Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.8967814Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:22.8969141Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:22.8970451Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:22.8971332Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:22.8972587Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:22.8973847Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:22.8974934Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:22.8976122Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:22.8977404Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:22.8978741Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:22.8979690Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:22.8980826Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:22.8981941Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:22.8982854Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:22.8984079Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:22.8985486Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:22.8986596Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.8987544Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.8988334Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:22.8989414Z W0507 20:32:22.891000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:22.9578320Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:22.9579461Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:22.9580861Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:22.9582413Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:22.9583433Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:22.9584788Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:22.9586223Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:22.9587746Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:22.9589195Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:22.9590293Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:22.9591607Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:22.9592912Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:22.9593912Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:22.9595166Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:22.9596431Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:22.9597506Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:22.9598573Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:22.9599846Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:22.9601276Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:22.9602270Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:22.9603399Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:22.9604482Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:22.9605289Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:22.9606522Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:22.9607926Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:22.9609032Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:22.9609991Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:22.9610773Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:22.9611946Z W0507 20:32:22.954000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.4502323Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:23.4503492Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:23.4504897Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:23.4506391Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:23.4507602Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:23.4508968Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:23.4510414Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.4511782Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:23.4513236Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.4514486Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:23.4515814Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:23.4517121Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:23.4518010Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.4519279Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:23.4520620Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:23.4521703Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:23.4522774Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:23.4524052Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:23.4525536Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:23.4526489Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.4527627Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:23.4528718Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:23.4529532Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:23.4530765Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:23.4532341Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:23.4533451Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.4534411Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.4535198Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:23.4536279Z W0507 20:32:23.446000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.5117397Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:23.5118518Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:23.5120253Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:23.5122086Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:23.5123326Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:23.5125010Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:23.5126448Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.5127819Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:23.5129269Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.5130535Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:23.5131875Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:23.5133232Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:23.5134129Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.5135396Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:23.5136781Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:23.5137876Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:23.5138950Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:23.5140232Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:23.5141575Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:23.5142592Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:23.5143732Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:23.5144828Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:23.5145648Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:23.5146883Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:23.5148305Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:23.5149427Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.5150392Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:23.5151179Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:23.5152263Z W0507 20:32:23.507000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.7816366Z 
2025-05-07T20:32:23.7816654Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.7817120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.7817652Z     T=2048,
2025-05-07T20:32:23.7818148Z     D=5120,
2025-05-07T20:32:23.7818359Z     scale_ub=None,
2025-05-07T20:32:23.7818581Z     contiguous=True,
2025-05-07T20:32:23.7818813Z     compiled=True,
2025-05-07T20:32:23.7819029Z )
2025-05-07T20:32:23.7819358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.7819871Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:23.7820154Z 
2025-05-07T20:32:23.7820237Z     @given(
2025-05-07T20:32:23.7820474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.7820797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.7821119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.7821462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.7821803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.7822100Z     )
2025-05-07T20:32:23.7822468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.7823046Z     def test_silu_mul_quant(
2025-05-07T20:32:23.7823303Z         self,
2025-05-07T20:32:23.7823505Z         T: int,
2025-05-07T20:32:23.7823704Z         D: int,
2025-05-07T20:32:23.7823933Z         scale_ub: Optional[float],
2025-05-07T20:32:23.7824218Z         contiguous: bool,
2025-05-07T20:32:23.7824464Z         compiled: bool,
2025-05-07T20:32:23.7824702Z     ) -> None:
2025-05-07T20:32:23.7824923Z         torch.manual_seed(2025)
2025-05-07T20:32:23.7825173Z     
2025-05-07T20:32:23.7825457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.7825814Z     
2025-05-07T20:32:23.7826016Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.7826314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.7826637Z         x = x_sign * x_clamp
2025-05-07T20:32:23.7826890Z         x0 = x[:, :D]
2025-05-07T20:32:23.7827110Z         x1 = x[:, D:]
2025-05-07T20:32:23.7827332Z     
2025-05-07T20:32:23.7827527Z         if contiguous:
2025-05-07T20:32:23.7827766Z             x0 = x0.contiguous()
2025-05-07T20:32:23.7828035Z             x1 = x1.contiguous()
2025-05-07T20:32:23.7828289Z     
2025-05-07T20:32:23.7828488Z         if scale_ub is not None:
2025-05-07T20:32:23.7828769Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.7829116Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.7829437Z             )
2025-05-07T20:32:23.7829633Z         else:
2025-05-07T20:32:23.7829854Z             scale_ub_tensor = None
2025-05-07T20:32:23.7830115Z     
2025-05-07T20:32:23.7830350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.7830678Z             op = silu_mul_quant
2025-05-07T20:32:23.7830939Z             if compiled:
2025-05-07T20:32:23.7831192Z                 op = torch.compile(op)
2025-05-07T20:32:23.7831499Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.7831790Z     
2025-05-07T20:32:23.7831988Z         y_fp8, y_scale = fn()
2025-05-07T20:32:23.7832291Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:23.7832597Z     
2025-05-07T20:32:23.7832841Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.7833202Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:23.7833511Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:23.7833840Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:23.7834219Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.7834540Z     
2025-05-07T20:32:23.7840524Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:23.7840772Z 
2025-05-07T20:32:23.7840894Z moe/activation_test.py:126: 
2025-05-07T20:32:23.7841221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.7841582Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:23.7841935Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.7842934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:23.7843717Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:23.7844291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.7845011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.7845734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:23.7846489Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:23.7847259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:23.7847933Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:23.7848653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:23.7849192Z     fn()
2025-05-07T20:32:23.7849728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:23.7850343Z     self.fn.run(
2025-05-07T20:32:23.7850831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.7851388Z     kernel = self.compile(
2025-05-07T20:32:23.7851963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.7852650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.7853066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.7853313Z 
2025-05-07T20:32:23.7853533Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecd6840>
2025-05-07T20:32:23.7854672Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.7856103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e716700>}
2025-05-07T20:32:23.7857495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.7858564Z context = <triton._C.libtriton.ir.context object at 0x7f328e251570>
2025-05-07T20:32:23.7858871Z 
2025-05-07T20:32:23.7859048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.7859600Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.7860101Z                            module_map=module_map)
2025-05-07T20:32:23.7860487Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.7860867Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:23.7861145Z E       ^
2025-05-07T20:32:23.7861638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.7862116Z 
2025-05-07T20:32:23.7862551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:23.7863089Z 
2025-05-07T20:32:23.7863201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:23.7863634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:23.7864056Z     T=128,
2025-05-07T20:32:23.7864257Z     D=5120,
2025-05-07T20:32:23.7864455Z     scale_ub=None,
2025-05-07T20:32:23.7864685Z     contiguous=True,
2025-05-07T20:32:23.7864920Z     compiled=True,
2025-05-07T20:32:23.7865210Z )
2025-05-07T20:32:23.7865549Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:23.7866064Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:23.7866344Z 
2025-05-07T20:32:23.7866436Z     @given(
2025-05-07T20:32:23.7866674Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:23.7867012Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:23.7867338Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:23.7867685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:23.7868035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:23.7868340Z     )
2025-05-07T20:32:23.7868707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:23.7869167Z     def test_silu_mul_quant(
2025-05-07T20:32:23.7869509Z         self,
2025-05-07T20:32:23.7869713Z         T: int,
2025-05-07T20:32:23.7869921Z         D: int,
2025-05-07T20:32:23.7870151Z         scale_ub: Optional[float],
2025-05-07T20:32:23.7870442Z         contiguous: bool,
2025-05-07T20:32:23.7870695Z         compiled: bool,
2025-05-07T20:32:23.7870930Z     ) -> None:
2025-05-07T20:32:23.7871160Z         torch.manual_seed(2025)
2025-05-07T20:32:23.7871408Z     
2025-05-07T20:32:23.7871700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:23.7872061Z     
2025-05-07T20:32:23.7872266Z         x_sign = torch.sign(x)
2025-05-07T20:32:23.7872624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:23.7872954Z         x = x_sign * x_clamp
2025-05-07T20:32:23.7873204Z         x0 = x[:, :D]
2025-05-07T20:32:23.7873436Z         x1 = x[:, D:]
2025-05-07T20:32:23.7873661Z     
2025-05-07T20:32:23.7873853Z         if contiguous:
2025-05-07T20:32:23.7874095Z             x0 = x0.contiguous()
2025-05-07T20:32:23.7874379Z             x1 = x1.contiguous()
2025-05-07T20:32:23.7874629Z     
2025-05-07T20:32:23.7874837Z         if scale_ub is not None:
2025-05-07T20:32:23.7875128Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:23.7875484Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:23.7875808Z             )
2025-05-07T20:32:23.7876013Z         else:
2025-05-07T20:32:23.7876238Z             scale_ub_tensor = None
2025-05-07T20:32:23.7876499Z     
2025-05-07T20:32:23.7876742Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.7877074Z             op = silu_mul_quant
2025-05-07T20:32:23.7877336Z             if compiled:
2025-05-07T20:32:23.7877597Z                 op = torch.compile(op)
2025-05-07T20:32:23.7877922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:23.7878207Z     
2025-05-07T20:32:23.7878409Z         y_fp8, y_scale = fn()
2025-05-07T20:32:23.7878710Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:23.7879021Z     
2025-05-07T20:32:23.7879280Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:23.7879636Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:23.7879943Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:23.7880370Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:23.7880748Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.7881077Z     
2025-05-07T20:32:23.7881284Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:23.7881493Z 
2025-05-07T20:32:23.7881598Z moe/activation_test.py:126: 
2025-05-07T20:32:23.7881911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.7882260Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:23.7882609Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:23.7883426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:23.7884297Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:23.7884871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:23.7885588Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:23.7886310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:23.7887065Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:23.7887826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:23.7888491Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:23.7889123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:23.7889817Z     fn()
2025-05-07T20:32:23.7890362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:23.7890966Z     self.fn.run(
2025-05-07T20:32:23.7891457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:23.7892006Z     kernel = self.compile(
2025-05-07T20:32:23.7892573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:23.7893253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:23.7893663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:23.7893907Z 
2025-05-07T20:32:23.7894125Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f0ff8d0>
2025-05-07T20:32:23.7895251Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:23.7896676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9bcd260>}
2025-05-07T20:32:23.7898069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:23.7899127Z context = <triton._C.libtriton.ir.context object at 0x7f31f952f730>
2025-05-07T20:32:23.7899437Z 
2025-05-07T20:32:23.7899609Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:23.7900156Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:23.7900645Z                            module_map=module_map)
2025-05-07T20:32:23.7901032Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:23.7901416Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:23.7901698Z E       ^
2025-05-07T20:32:23.7902181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:23.7902701Z 
2025-05-07T20:32:23.7903134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.0200990Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.0202115Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:24.0203731Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.0205229Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.0206248Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.0207620Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.0209068Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0210579Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.0212028Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0213121Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:24.0214613Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.0215917Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:24.0216800Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.0218053Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.0219303Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:24.0220383Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:24.0221443Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:24.0222719Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.0224040Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.0224983Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.0226119Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:24.0227200Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:24.0228131Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:24.0229354Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.0230755Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.0231855Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0232856Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0233631Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:24.0234806Z W0507 20:32:24.016000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.0814471Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.0815582Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:24.0816975Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.0818470Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.0819492Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.0820849Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.0822295Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.0823661Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.0825104Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.0826196Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:24.0827510Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.0828810Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:24.0829694Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.0831098Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.0832367Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:24.0833444Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:24.0834509Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:24.0835787Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.0837238Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.0838179Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.0839320Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:24.0840498Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:24.0841307Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:24.0842540Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.0843949Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.0845053Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.0846008Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.0846790Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:24.0847859Z W0507 20:32:24.077000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.6245389Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.6246517Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:24.6247923Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.6249414Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.6250616Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.6251988Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.6253484Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.6254849Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.6256297Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.6257510Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:24.6258844Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.6260146Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:24.6261034Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.6262302Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.6263575Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:24.6264660Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:24.6265731Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:24.6267012Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.6268351Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.6269311Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.6270443Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:24.6271537Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:24.6272362Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:24.6273635Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.6275164Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.6276273Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.6277230Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.6278018Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:24.6279095Z W0507 20:32:24.620000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.6861469Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:24.6863201Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:24.6864605Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:24.6866091Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:24.6867120Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:24.6868501Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:24.6869950Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.6871319Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:24.6872764Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.6873863Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:24.6875196Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:24.6876506Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:24.6877397Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.6878660Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:24.6879924Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:24.6881261Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:24.6882356Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:24.6883673Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:24.6885017Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:24.6885973Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:24.6887197Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:24.6888293Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:24.6889108Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:24.6890334Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:24.6891759Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:24.6892886Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.6893849Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:24.6894634Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:24.6895711Z W0507 20:32:24.682000 228046 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9899151Z 
2025-05-07T20:32:24.9899650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9900703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9901643Z     T=4096,
2025-05-07T20:32:24.9902047Z     D=5120,
2025-05-07T20:32:24.9902343Z     scale_ub=None,
2025-05-07T20:32:24.9902609Z     contiguous=True,
2025-05-07T20:32:24.9902847Z     compiled=True,
2025-05-07T20:32:24.9903057Z )
2025-05-07T20:32:24.9903419Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9903929Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.9904206Z 
2025-05-07T20:32:24.9904292Z     @given(
2025-05-07T20:32:24.9904526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9904853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9905173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9905516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9905863Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9906163Z     )
2025-05-07T20:32:24.9906531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9906991Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9907242Z         self,
2025-05-07T20:32:24.9907606Z         T: int,
2025-05-07T20:32:24.9907809Z         D: int,
2025-05-07T20:32:24.9908034Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9908314Z         contiguous: bool,
2025-05-07T20:32:24.9908558Z         compiled: bool,
2025-05-07T20:32:24.9908793Z     ) -> None:
2025-05-07T20:32:24.9909016Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9909262Z     
2025-05-07T20:32:24.9909543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9909895Z     
2025-05-07T20:32:24.9910091Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9910393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9910712Z         x = x_sign * x_clamp
2025-05-07T20:32:24.9910955Z         x0 = x[:, :D]
2025-05-07T20:32:24.9911179Z         x1 = x[:, D:]
2025-05-07T20:32:24.9911394Z     
2025-05-07T20:32:24.9911580Z         if contiguous:
2025-05-07T20:32:24.9911938Z             x0 = x0.contiguous()
2025-05-07T20:32:24.9912217Z             x1 = x1.contiguous()
2025-05-07T20:32:24.9912488Z     
2025-05-07T20:32:24.9912708Z         if scale_ub is not None:
2025-05-07T20:32:24.9912994Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.9913519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.9913839Z             )
2025-05-07T20:32:24.9914037Z         else:
2025-05-07T20:32:24.9914255Z             scale_ub_tensor = None
2025-05-07T20:32:24.9914513Z     
2025-05-07T20:32:24.9914752Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9915079Z             op = silu_mul_quant
2025-05-07T20:32:24.9915331Z             if compiled:
2025-05-07T20:32:24.9915594Z                 op = torch.compile(op)
2025-05-07T20:32:24.9915898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9916178Z     
2025-05-07T20:32:24.9916379Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.9916681Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.9916980Z     
2025-05-07T20:32:24.9917229Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9917575Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.9917883Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.9918205Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.9918578Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.9918901Z     
2025-05-07T20:32:24.9919106Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.9919313Z 
2025-05-07T20:32:24.9919418Z moe/activation_test.py:126: 
2025-05-07T20:32:24.9919728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9920075Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.9920518Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.9921341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.9922128Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.9922696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.9923418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.9924137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.9924896Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.9925657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.9926328Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.9926961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.9927635Z     fn()
2025-05-07T20:32:24.9928166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.9928767Z     self.fn.run(
2025-05-07T20:32:24.9929256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.9929809Z     kernel = self.compile(
2025-05-07T20:32:24.9930369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.9931051Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.9931470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9931710Z 
2025-05-07T20:32:24.9931934Z self = <triton.compiler.compiler.ASTSource object at 0x7f328eccb9d0>
2025-05-07T20:32:24.9933058Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.9934648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f96ae700>}
2025-05-07T20:32:24.9936042Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.9937104Z context = <triton._C.libtriton.ir.context object at 0x7f31f930f430>
2025-05-07T20:32:24.9937406Z 
2025-05-07T20:32:24.9937588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.9938130Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.9938625Z                            module_map=module_map)
2025-05-07T20:32:24.9939011Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.9939379Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.9939660Z E       ^
2025-05-07T20:32:24.9940141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9940609Z 
2025-05-07T20:32:24.9941046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:24.9941578Z 
2025-05-07T20:32:24.9941689Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:24.9942124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:24.9942571Z     T=16384,
2025-05-07T20:32:24.9942794Z     D=5120,
2025-05-07T20:32:24.9942994Z     scale_ub=None,
2025-05-07T20:32:24.9943213Z     contiguous=True,
2025-05-07T20:32:24.9943437Z     compiled=True,
2025-05-07T20:32:24.9943653Z )
2025-05-07T20:32:24.9943988Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:24.9944509Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:24.9944794Z 
2025-05-07T20:32:24.9944886Z     @given(
2025-05-07T20:32:24.9945127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:24.9945455Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:24.9951539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:24.9951905Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:24.9952253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:24.9952566Z     )
2025-05-07T20:32:24.9952988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:24.9953446Z     def test_silu_mul_quant(
2025-05-07T20:32:24.9953703Z         self,
2025-05-07T20:32:24.9953908Z         T: int,
2025-05-07T20:32:24.9954128Z         D: int,
2025-05-07T20:32:24.9954350Z         scale_ub: Optional[float],
2025-05-07T20:32:24.9954750Z         contiguous: bool,
2025-05-07T20:32:24.9955009Z         compiled: bool,
2025-05-07T20:32:24.9955244Z     ) -> None:
2025-05-07T20:32:24.9955472Z         torch.manual_seed(2025)
2025-05-07T20:32:24.9955728Z     
2025-05-07T20:32:24.9956016Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:24.9956380Z     
2025-05-07T20:32:24.9956583Z         x_sign = torch.sign(x)
2025-05-07T20:32:24.9956887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:24.9957213Z         x = x_sign * x_clamp
2025-05-07T20:32:24.9957464Z         x0 = x[:, :D]
2025-05-07T20:32:24.9957684Z         x1 = x[:, D:]
2025-05-07T20:32:24.9957903Z     
2025-05-07T20:32:24.9958099Z         if contiguous:
2025-05-07T20:32:24.9958337Z             x0 = x0.contiguous()
2025-05-07T20:32:24.9958607Z             x1 = x1.contiguous()
2025-05-07T20:32:24.9958859Z     
2025-05-07T20:32:24.9959141Z         if scale_ub is not None:
2025-05-07T20:32:24.9959436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:24.9959790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:24.9960173Z             )
2025-05-07T20:32:24.9960372Z         else:
2025-05-07T20:32:24.9960592Z             scale_ub_tensor = None
2025-05-07T20:32:24.9960862Z     
2025-05-07T20:32:24.9961104Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9961436Z             op = silu_mul_quant
2025-05-07T20:32:24.9961697Z             if compiled:
2025-05-07T20:32:24.9961950Z                 op = torch.compile(op)
2025-05-07T20:32:24.9962262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:24.9962548Z     
2025-05-07T20:32:24.9962745Z         y_fp8, y_scale = fn()
2025-05-07T20:32:24.9963041Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:24.9963347Z     
2025-05-07T20:32:24.9963591Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:24.9963951Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:24.9964256Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:24.9964585Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:24.9964957Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.9965286Z     
2025-05-07T20:32:24.9965497Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:24.9965704Z 
2025-05-07T20:32:24.9965812Z moe/activation_test.py:126: 
2025-05-07T20:32:24.9966127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9966484Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:24.9966827Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:24.9967653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:24.9968442Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:24.9969014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:24.9969725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:24.9970445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:24.9971205Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:24.9971974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:24.9972644Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:24.9973278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:24.9973822Z     fn()
2025-05-07T20:32:24.9974348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:24.9975058Z     self.fn.run(
2025-05-07T20:32:24.9975549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:24.9976107Z     kernel = self.compile(
2025-05-07T20:32:24.9976671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:24.9977361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:24.9977782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:24.9978024Z 
2025-05-07T20:32:24.9978241Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f0fe4d0>
2025-05-07T20:32:24.9979366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:24.9980878Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f941eb60>}
2025-05-07T20:32:24.9982269Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:24.9983328Z context = <triton._C.libtriton.ir.context object at 0x7f31f8bbdbf0>
2025-05-07T20:32:24.9983630Z 
2025-05-07T20:32:24.9983805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:24.9984350Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:24.9984841Z                            module_map=module_map)
2025-05-07T20:32:24.9985225Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:24.9985600Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:24.9985885Z E       ^
2025-05-07T20:32:24.9986365Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:24.9986833Z 
2025-05-07T20:32:24.9987266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.0172716Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:25.0174013Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:25.0175393Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:25.0176438Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:25.0177586Z W0507 20:32:25.016000 228046 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:25.4352698Z 
2025-05-07T20:32:25.4353179Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4353684Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4354336Z     T=1,
2025-05-07T20:32:25.4354607Z     D=5120,
2025-05-07T20:32:25.4354885Z     scale_ub=1200.0,
2025-05-07T20:32:25.4355205Z     contiguous=True,
2025-05-07T20:32:25.4355479Z     compiled=True,
2025-05-07T20:32:25.4355701Z )
2025-05-07T20:32:25.4356029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4356549Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.4356999Z 
2025-05-07T20:32:25.4357086Z     @given(
2025-05-07T20:32:25.4357329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4357649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4357982Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4358327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4358665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4358970Z     )
2025-05-07T20:32:25.4359337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4359799Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4360046Z         self,
2025-05-07T20:32:25.4360329Z         T: int,
2025-05-07T20:32:25.4360537Z         D: int,
2025-05-07T20:32:25.4360755Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4361041Z         contiguous: bool,
2025-05-07T20:32:25.4361421Z         compiled: bool,
2025-05-07T20:32:25.4361649Z     ) -> None:
2025-05-07T20:32:25.4361875Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4362123Z     
2025-05-07T20:32:25.4362403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4362783Z     
2025-05-07T20:32:25.4363008Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4363313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4363641Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4363889Z         x0 = x[:, :D]
2025-05-07T20:32:25.4364116Z         x1 = x[:, D:]
2025-05-07T20:32:25.4364338Z     
2025-05-07T20:32:25.4364536Z         if contiguous:
2025-05-07T20:32:25.4364769Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4365047Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4365299Z     
2025-05-07T20:32:25.4365493Z         if scale_ub is not None:
2025-05-07T20:32:25.4365785Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4366144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4366477Z             )
2025-05-07T20:32:25.4366678Z         else:
2025-05-07T20:32:25.4366895Z             scale_ub_tensor = None
2025-05-07T20:32:25.4367154Z     
2025-05-07T20:32:25.4367389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4367716Z             op = silu_mul_quant
2025-05-07T20:32:25.4367982Z             if compiled:
2025-05-07T20:32:25.4368233Z                 op = torch.compile(op)
2025-05-07T20:32:25.4368545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4368829Z     
2025-05-07T20:32:25.4369030Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.4369204Z 
2025-05-07T20:32:25.4369306Z moe/activation_test.py:117: 
2025-05-07T20:32:25.4369612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4369958Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.4370248Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4370849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.4371425Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.4372109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.4372818Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.4373378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4374076Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4374764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4375318Z     kernel = self.compile(
2025-05-07T20:32:25.4375879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4376562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4377065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4377308Z 
2025-05-07T20:32:25.4377526Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f961ec50>
2025-05-07T20:32:25.4378642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4380063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a47060>}
2025-05-07T20:32:25.4381455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4382628Z context = <triton._C.libtriton.ir.context object at 0x7f31f8bf88f0>
2025-05-07T20:32:25.4382926Z 
2025-05-07T20:32:25.4383103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4383641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4384126Z                            module_map=module_map)
2025-05-07T20:32:25.4384503Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4384864Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.4385129Z E       ^
2025-05-07T20:32:25.4385610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4386076Z 
2025-05-07T20:32:25.4386514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.4387044Z 
2025-05-07T20:32:25.4387155Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.4387588Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.4388011Z     T=1,
2025-05-07T20:32:25.4388196Z     D=5120,
2025-05-07T20:32:25.4388394Z     scale_ub=None,
2025-05-07T20:32:25.4388617Z     contiguous=False,
2025-05-07T20:32:25.4388841Z     compiled=True,
2025-05-07T20:32:25.4389048Z )
2025-05-07T20:32:25.4389376Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.4389877Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.4390147Z 
2025-05-07T20:32:25.4390226Z     @given(
2025-05-07T20:32:25.4390462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.4390786Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.4391095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.4391433Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.4391783Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.4392077Z     )
2025-05-07T20:32:25.4392443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.4392954Z     def test_silu_mul_quant(
2025-05-07T20:32:25.4393204Z         self,
2025-05-07T20:32:25.4393402Z         T: int,
2025-05-07T20:32:25.4393603Z         D: int,
2025-05-07T20:32:25.4393832Z         scale_ub: Optional[float],
2025-05-07T20:32:25.4394108Z         contiguous: bool,
2025-05-07T20:32:25.4394354Z         compiled: bool,
2025-05-07T20:32:25.4394583Z     ) -> None:
2025-05-07T20:32:25.4394798Z         torch.manual_seed(2025)
2025-05-07T20:32:25.4395050Z     
2025-05-07T20:32:25.4395334Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.4395682Z     
2025-05-07T20:32:25.4395882Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.4396183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.4396505Z         x = x_sign * x_clamp
2025-05-07T20:32:25.4396752Z         x0 = x[:, :D]
2025-05-07T20:32:25.4397063Z         x1 = x[:, D:]
2025-05-07T20:32:25.4397279Z     
2025-05-07T20:32:25.4397472Z         if contiguous:
2025-05-07T20:32:25.4397710Z             x0 = x0.contiguous()
2025-05-07T20:32:25.4397972Z             x1 = x1.contiguous()
2025-05-07T20:32:25.4398220Z     
2025-05-07T20:32:25.4398418Z         if scale_ub is not None:
2025-05-07T20:32:25.4398702Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.4399045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.4399365Z             )
2025-05-07T20:32:25.4399565Z         else:
2025-05-07T20:32:25.4399777Z             scale_ub_tensor = None
2025-05-07T20:32:25.4400036Z     
2025-05-07T20:32:25.4400338Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4400659Z             op = silu_mul_quant
2025-05-07T20:32:25.4400920Z             if compiled:
2025-05-07T20:32:25.4401265Z                 op = torch.compile(op)
2025-05-07T20:32:25.4401573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.4401856Z     
2025-05-07T20:32:25.4402051Z         y_fp8, y_scale = fn()
2025-05-07T20:32:25.4402341Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:25.4402641Z     
2025-05-07T20:32:25.4402885Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.4403235Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:25.4403533Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:25.4403856Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:25.4404232Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.4404549Z     
2025-05-07T20:32:25.4404755Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:25.4404957Z 
2025-05-07T20:32:25.4405066Z moe/activation_test.py:126: 
2025-05-07T20:32:25.4405370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4405729Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:25.4406075Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:25.4406889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:25.4407659Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:25.4408227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.4408938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.4409652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:25.4410408Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:25.4411174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:25.4411851Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:25.4412477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:25.4413013Z     fn()
2025-05-07T20:32:25.4413702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:25.4414324Z     self.fn.run(
2025-05-07T20:32:25.4414805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.4415355Z     kernel = self.compile(
2025-05-07T20:32:25.4415919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.4416595Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.4417008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.4417253Z 
2025-05-07T20:32:25.4417593Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f961d150>
2025-05-07T20:32:25.4418720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.4420141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f95a6a20>}
﻿2025-05-07T20:32:25.4426609Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.4427673Z context = <triton._C.libtriton.ir.context object at 0x7f31f846e070>
2025-05-07T20:32:25.4428069Z 
2025-05-07T20:32:25.4428252Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.4428803Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.4429295Z                            module_map=module_map)
2025-05-07T20:32:25.4429682Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.4430053Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:25.4430327Z E       ^
2025-05-07T20:32:25.4430814Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.4431284Z 
2025-05-07T20:32:25.4431738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5844002Z 
2025-05-07T20:32:25.5844200Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5844849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5845466Z     T=1,
2025-05-07T20:32:25.5845737Z     D=5120,
2025-05-07T20:32:25.5845954Z     scale_ub=None,
2025-05-07T20:32:25.5846243Z     contiguous=True,
2025-05-07T20:32:25.5846568Z     compiled=False,
2025-05-07T20:32:25.5846862Z )
2025-05-07T20:32:25.5847261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.5847808Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:25.5848078Z 
2025-05-07T20:32:25.5848161Z     @given(
2025-05-07T20:32:25.5848404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.5848743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.5849059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.5849408Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.5849752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.5850051Z     )
2025-05-07T20:32:25.5850411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.5850876Z     def test_silu_mul_quant(
2025-05-07T20:32:25.5851129Z         self,
2025-05-07T20:32:25.5851328Z         T: int,
2025-05-07T20:32:25.5851535Z         D: int,
2025-05-07T20:32:25.5851766Z         scale_ub: Optional[float],
2025-05-07T20:32:25.5852044Z         contiguous: bool,
2025-05-07T20:32:25.5852297Z         compiled: bool,
2025-05-07T20:32:25.5852531Z     ) -> None:
2025-05-07T20:32:25.5852755Z         torch.manual_seed(2025)
2025-05-07T20:32:25.5853011Z     
2025-05-07T20:32:25.5853296Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.5853650Z     
2025-05-07T20:32:25.5853856Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.5854160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.5854487Z         x = x_sign * x_clamp
2025-05-07T20:32:25.5854732Z         x0 = x[:, :D]
2025-05-07T20:32:25.5854957Z         x1 = x[:, D:]
2025-05-07T20:32:25.5855177Z     
2025-05-07T20:32:25.5855370Z         if contiguous:
2025-05-07T20:32:25.5855809Z             x0 = x0.contiguous()
2025-05-07T20:32:25.5856086Z             x1 = x1.contiguous()
2025-05-07T20:32:25.5856334Z     
2025-05-07T20:32:25.5856534Z         if scale_ub is not None:
2025-05-07T20:32:25.5856823Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.5857169Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.5857491Z             )
2025-05-07T20:32:25.5857694Z         else:
2025-05-07T20:32:25.5857912Z             scale_ub_tensor = None
2025-05-07T20:32:25.5858176Z     
2025-05-07T20:32:25.5858419Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5858864Z             op = silu_mul_quant
2025-05-07T20:32:25.5859129Z             if compiled:
2025-05-07T20:32:25.5859387Z                 op = torch.compile(op)
2025-05-07T20:32:25.5859693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5859988Z     
2025-05-07T20:32:25.5860258Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.5860432Z 
2025-05-07T20:32:25.5860549Z moe/activation_test.py:117: 
2025-05-07T20:32:25.5860859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5861211Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.5861510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5862229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.5862952Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.5863521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.5864245Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.5864947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.5865506Z     kernel = self.compile(
2025-05-07T20:32:25.5866080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.5866767Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.5867189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5867433Z 
2025-05-07T20:32:25.5867654Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f838fad0>
2025-05-07T20:32:25.5868779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.5870211Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328f2b8860>}
2025-05-07T20:32:25.5871615Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.5872694Z context = <triton._C.libtriton.ir.context object at 0x7f31f9266c30>
2025-05-07T20:32:25.5873036Z 
2025-05-07T20:32:25.5873218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.5873766Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.5874265Z                            module_map=module_map)
2025-05-07T20:32:25.5874649Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.5875024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.5875291Z E       ^
2025-05-07T20:32:25.5875780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5876247Z 
2025-05-07T20:32:25.5876771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5877306Z 
2025-05-07T20:32:25.5877420Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5877847Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5878269Z     T=128,
2025-05-07T20:32:25.5878467Z     D=5120,
2025-05-07T20:32:25.5878662Z     scale_ub=None,
2025-05-07T20:32:25.5878888Z     contiguous=False,
2025-05-07T20:32:25.5879124Z     compiled=True,
2025-05-07T20:32:25.5879333Z )
2025-05-07T20:32:25.5879669Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.5880408Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:25.5880689Z 
2025-05-07T20:32:25.5880769Z     @given(
2025-05-07T20:32:25.5881010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.5881338Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.5881705Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.5882065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.5882406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.5882714Z     )
2025-05-07T20:32:25.5889767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.5890241Z     def test_silu_mul_quant(
2025-05-07T20:32:25.5890501Z         self,
2025-05-07T20:32:25.5890708Z         T: int,
2025-05-07T20:32:25.5890908Z         D: int,
2025-05-07T20:32:25.5891139Z         scale_ub: Optional[float],
2025-05-07T20:32:25.5891436Z         contiguous: bool,
2025-05-07T20:32:25.5891683Z         compiled: bool,
2025-05-07T20:32:25.5891920Z     ) -> None:
2025-05-07T20:32:25.5892144Z         torch.manual_seed(2025)
2025-05-07T20:32:25.5892393Z     
2025-05-07T20:32:25.5892678Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.5893045Z     
2025-05-07T20:32:25.5893250Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.5893556Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.5893887Z         x = x_sign * x_clamp
2025-05-07T20:32:25.5894137Z         x0 = x[:, :D]
2025-05-07T20:32:25.5894359Z         x1 = x[:, D:]
2025-05-07T20:32:25.5894576Z     
2025-05-07T20:32:25.5894771Z         if contiguous:
2025-05-07T20:32:25.5895005Z             x0 = x0.contiguous()
2025-05-07T20:32:25.5895275Z             x1 = x1.contiguous()
2025-05-07T20:32:25.5895528Z     
2025-05-07T20:32:25.5895722Z         if scale_ub is not None:
2025-05-07T20:32:25.5896015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.5896373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.5896695Z             )
2025-05-07T20:32:25.5896898Z         else:
2025-05-07T20:32:25.5897122Z             scale_ub_tensor = None
2025-05-07T20:32:25.5897379Z     
2025-05-07T20:32:25.5897621Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.5897955Z             op = silu_mul_quant
2025-05-07T20:32:25.5898214Z             if compiled:
2025-05-07T20:32:25.5898474Z                 op = torch.compile(op)
2025-05-07T20:32:25.5898786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5899077Z     
2025-05-07T20:32:25.5899272Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.5899449Z 
2025-05-07T20:32:25.5899553Z moe/activation_test.py:117: 
2025-05-07T20:32:25.5899863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5900210Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.5900506Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.5901097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.5901674Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.5902365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.5903248Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.5903817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.5904525Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.5905219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.5905771Z     kernel = self.compile(
2025-05-07T20:32:25.5906344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.5907076Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.5907496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.5907734Z 
2025-05-07T20:32:25.5907955Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959db50>
2025-05-07T20:32:25.5909124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.5910543Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9b8df80>}
2025-05-07T20:32:25.5911937Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.5912999Z context = <triton._C.libtriton.ir.context object at 0x7f31f9033d70>
2025-05-07T20:32:25.5913299Z 
2025-05-07T20:32:25.5913847Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.5914393Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.5914896Z                            module_map=module_map)
2025-05-07T20:32:25.5915279Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.5915648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.5915912Z E       ^
2025-05-07T20:32:25.5916398Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.5916865Z 
2025-05-07T20:32:25.5917305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.5917839Z 
2025-05-07T20:32:25.5917949Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.5918381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.5918802Z     T=128,
2025-05-07T20:32:25.5918997Z     D=7168,
2025-05-07T20:32:25.5919194Z     scale_ub=1200.0,
2025-05-07T20:32:25.5919431Z     contiguous=False,
2025-05-07T20:32:25.5919672Z     compiled=False,
2025-05-07T20:32:25.7462466Z )
2025-05-07T20:32:25.7462937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7463804Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:25.7464216Z 
2025-05-07T20:32:25.7464335Z     @given(
2025-05-07T20:32:25.7464662Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7465071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7465392Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7465742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7466094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7466396Z     )
2025-05-07T20:32:25.7466766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7467229Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7467489Z         self,
2025-05-07T20:32:25.7467692Z         T: int,
2025-05-07T20:32:25.7468077Z         D: int,
2025-05-07T20:32:25.7468314Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7468596Z         contiguous: bool,
2025-05-07T20:32:25.7468852Z         compiled: bool,
2025-05-07T20:32:25.7469091Z     ) -> None:
2025-05-07T20:32:25.7469312Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7469565Z     
2025-05-07T20:32:25.7469858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7470215Z     
2025-05-07T20:32:25.7470425Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7470735Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7471131Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7471386Z         x0 = x[:, :D]
2025-05-07T20:32:25.7471622Z         x1 = x[:, D:]
2025-05-07T20:32:25.7471843Z     
2025-05-07T20:32:25.7472054Z         if contiguous:
2025-05-07T20:32:25.7472299Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7472640Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7472899Z     
2025-05-07T20:32:25.7473110Z         if scale_ub is not None:
2025-05-07T20:32:25.7473390Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7473749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7474079Z             )
2025-05-07T20:32:25.7474289Z         else:
2025-05-07T20:32:25.7474509Z             scale_ub_tensor = None
2025-05-07T20:32:25.7474778Z     
2025-05-07T20:32:25.7475024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7475353Z             op = silu_mul_quant
2025-05-07T20:32:25.7475617Z             if compiled:
2025-05-07T20:32:25.7475884Z                 op = torch.compile(op)
2025-05-07T20:32:25.7476193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7476484Z     
2025-05-07T20:32:25.7476692Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.7476866Z 
2025-05-07T20:32:25.7476973Z moe/activation_test.py:117: 
2025-05-07T20:32:25.7477292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7477647Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.7477947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7478669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.7479391Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.7479956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7480784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7481482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7482043Z     kernel = self.compile(
2025-05-07T20:32:25.7482625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7483352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7483770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7484011Z 
2025-05-07T20:32:25.7484232Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9bda150>
2025-05-07T20:32:25.7485357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7486783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328f033060>}
2025-05-07T20:32:25.7488176Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7489911Z context = <triton._C.libtriton.ir.context object at 0x7f31f905e270>
2025-05-07T20:32:25.7490222Z 
2025-05-07T20:32:25.7490406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7490953Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7491442Z                            module_map=module_map)
2025-05-07T20:32:25.7491827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7492203Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.7492525Z E       ^
2025-05-07T20:32:25.7493018Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7493484Z 
2025-05-07T20:32:25.7493924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.7494496Z 
2025-05-07T20:32:25.7494619Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.7495057Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.7495474Z     T=128,
2025-05-07T20:32:25.7495675Z     D=5120,
2025-05-07T20:32:25.7495880Z     scale_ub=None,
2025-05-07T20:32:25.7496105Z     contiguous=False,
2025-05-07T20:32:25.7496343Z     compiled=False,
2025-05-07T20:32:25.7496565Z )
2025-05-07T20:32:25.7496903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7497415Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:25.7497703Z 
2025-05-07T20:32:25.7497785Z     @given(
2025-05-07T20:32:25.7498030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7498353Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7498679Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7499029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7499383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7499684Z     )
2025-05-07T20:32:25.7500050Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7500511Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7500761Z         self,
2025-05-07T20:32:25.7500966Z         T: int,
2025-05-07T20:32:25.7501178Z         D: int,
2025-05-07T20:32:25.7501402Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7501690Z         contiguous: bool,
2025-05-07T20:32:25.7501946Z         compiled: bool,
2025-05-07T20:32:25.7502181Z     ) -> None:
2025-05-07T20:32:25.7502410Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7502686Z     
2025-05-07T20:32:25.7502999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7503360Z     
2025-05-07T20:32:25.7503567Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7503868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7504200Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7504466Z         x0 = x[:, :D]
2025-05-07T20:32:25.7504690Z         x1 = x[:, D:]
2025-05-07T20:32:25.7504910Z     
2025-05-07T20:32:25.7505111Z         if contiguous:
2025-05-07T20:32:25.7505350Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7505622Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7505875Z     
2025-05-07T20:32:25.7506082Z         if scale_ub is not None:
2025-05-07T20:32:25.7506364Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7506718Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7507047Z             )
2025-05-07T20:32:25.7507250Z         else:
2025-05-07T20:32:25.7507474Z             scale_ub_tensor = None
2025-05-07T20:32:25.7507736Z     
2025-05-07T20:32:25.7507979Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7508310Z             op = silu_mul_quant
2025-05-07T20:32:25.7508576Z             if compiled:
2025-05-07T20:32:25.7508835Z                 op = torch.compile(op)
2025-05-07T20:32:25.7509233Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7509526Z     
2025-05-07T20:32:25.7509725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.7509903Z 
2025-05-07T20:32:25.7510008Z moe/activation_test.py:117: 
2025-05-07T20:32:25.7510318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7510668Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.7510963Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7511683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.7512474Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.7513066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7514081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7514875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7515434Z     kernel = self.compile(
2025-05-07T20:32:25.7515992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7516682Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7517099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7517340Z 
2025-05-07T20:32:25.7517563Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f92ff050>
2025-05-07T20:32:25.7518684Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7520164Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e1872e0>}
2025-05-07T20:32:25.7521559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7522619Z context = <triton._C.libtriton.ir.context object at 0x7f31f90b52f0>
2025-05-07T20:32:25.7522922Z 
2025-05-07T20:32:25.7523097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7523647Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7524141Z                            module_map=module_map)
2025-05-07T20:32:25.7524525Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7524894Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.7525173Z E       ^
2025-05-07T20:32:25.7525666Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7526135Z 
2025-05-07T20:32:25.7526568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.7527107Z 
2025-05-07T20:32:25.7527219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.7527658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.7528085Z     T=128,
2025-05-07T20:32:25.7528278Z     D=5120,
2025-05-07T20:32:25.7528485Z     scale_ub=1200.0,
2025-05-07T20:32:25.7528723Z     contiguous=True,
2025-05-07T20:32:25.7528955Z     compiled=False,
2025-05-07T20:32:25.7529174Z )
2025-05-07T20:32:25.7529513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.7530028Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:25.7530319Z 
2025-05-07T20:32:25.7530400Z     @given(
2025-05-07T20:32:25.7530762Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.7531099Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.7531420Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.7531768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.7532116Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.7532417Z     )
2025-05-07T20:32:25.7532787Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.7533250Z     def test_silu_mul_quant(
2025-05-07T20:32:25.7533564Z         self,
2025-05-07T20:32:25.7533774Z         T: int,
2025-05-07T20:32:25.7533985Z         D: int,
2025-05-07T20:32:25.7534236Z         scale_ub: Optional[float],
2025-05-07T20:32:25.7534518Z         contiguous: bool,
2025-05-07T20:32:25.7534772Z         compiled: bool,
2025-05-07T20:32:25.7535006Z     ) -> None:
2025-05-07T20:32:25.7535270Z         torch.manual_seed(2025)
2025-05-07T20:32:25.7535527Z     
2025-05-07T20:32:25.7535822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.7536175Z     
2025-05-07T20:32:25.7536381Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.7536688Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.7537010Z         x = x_sign * x_clamp
2025-05-07T20:32:25.7537262Z         x0 = x[:, :D]
2025-05-07T20:32:25.7537490Z         x1 = x[:, D:]
2025-05-07T20:32:25.7537704Z     
2025-05-07T20:32:25.7537901Z         if contiguous:
2025-05-07T20:32:25.7538143Z             x0 = x0.contiguous()
2025-05-07T20:32:25.7538414Z             x1 = x1.contiguous()
2025-05-07T20:32:25.7538665Z     
2025-05-07T20:32:25.7538866Z         if scale_ub is not None:
2025-05-07T20:32:25.7539151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.7539499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.7539828Z             )
2025-05-07T20:32:25.7540029Z         else:
2025-05-07T20:32:25.7540253Z             scale_ub_tensor = None
2025-05-07T20:32:25.7540522Z     
2025-05-07T20:32:25.7540765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.7541089Z             op = silu_mul_quant
2025-05-07T20:32:25.7541350Z             if compiled:
2025-05-07T20:32:25.7541609Z                 op = torch.compile(op)
2025-05-07T20:32:25.7541915Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7542203Z     
2025-05-07T20:32:25.7542406Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.7542585Z 
2025-05-07T20:32:25.7542690Z moe/activation_test.py:117: 
2025-05-07T20:32:25.7543058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7543407Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.7543704Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.7544418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.7545144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.7545705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.7546413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.7547108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.7547664Z     kernel = self.compile(
2025-05-07T20:32:25.7548227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.7548910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.7549329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.7549567Z 
2025-05-07T20:32:25.7549786Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8de9ed0>
2025-05-07T20:32:25.7550988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.7552408Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e186ac0>}
2025-05-07T20:32:25.7553796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.7554900Z context = <triton._C.libtriton.ir.context object at 0x7f31f90e6ff0>
2025-05-07T20:32:25.7555201Z 
2025-05-07T20:32:25.7555381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.7555965Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.7556462Z                            module_map=module_map)
2025-05-07T20:32:25.7556846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.7557214Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.7557483Z E       ^
2025-05-07T20:32:25.7557971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.7558439Z 
2025-05-07T20:32:25.7558878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.9082748Z 
2025-05-07T20:32:25.9082992Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.9083459Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.9083912Z     T=1,
2025-05-07T20:32:25.9084147Z     D=7168,
2025-05-07T20:32:25.9084353Z     scale_ub=1200.0,
2025-05-07T20:32:25.9084593Z     contiguous=True,
2025-05-07T20:32:25.9084826Z     compiled=True,
2025-05-07T20:32:25.9085039Z )
2025-05-07T20:32:25.9085380Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.9085886Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:25.9086164Z 
2025-05-07T20:32:25.9086246Z     @given(
2025-05-07T20:32:25.9086488Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.9086811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.9087134Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.9087483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.9087836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.9088132Z     )
2025-05-07T20:32:25.9088500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.9088959Z     def test_silu_mul_quant(
2025-05-07T20:32:25.9089210Z         self,
2025-05-07T20:32:25.9089415Z         T: int,
2025-05-07T20:32:25.9089627Z         D: int,
2025-05-07T20:32:25.9089857Z         scale_ub: Optional[float],
2025-05-07T20:32:25.9090135Z         contiguous: bool,
2025-05-07T20:32:25.9090391Z         compiled: bool,
2025-05-07T20:32:25.9090624Z     ) -> None:
2025-05-07T20:32:25.9090845Z         torch.manual_seed(2025)
2025-05-07T20:32:25.9091098Z     
2025-05-07T20:32:25.9091384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.9091737Z     
2025-05-07T20:32:25.9091942Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.9092250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.9092573Z         x = x_sign * x_clamp
2025-05-07T20:32:25.9092824Z         x0 = x[:, :D]
2025-05-07T20:32:25.9093050Z         x1 = x[:, D:]
2025-05-07T20:32:25.9093266Z     
2025-05-07T20:32:25.9093465Z         if contiguous:
2025-05-07T20:32:25.9093707Z             x0 = x0.contiguous()
2025-05-07T20:32:25.9093983Z             x1 = x1.contiguous()
2025-05-07T20:32:25.9094229Z     
2025-05-07T20:32:25.9094601Z         if scale_ub is not None:
2025-05-07T20:32:25.9094896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.9095249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.9095576Z             )
2025-05-07T20:32:25.9095777Z         else:
2025-05-07T20:32:25.9095993Z             scale_ub_tensor = None
2025-05-07T20:32:25.9096257Z     
2025-05-07T20:32:25.9096501Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.9096827Z             op = silu_mul_quant
2025-05-07T20:32:25.9097088Z             if compiled:
2025-05-07T20:32:25.9097419Z                 op = torch.compile(op)
2025-05-07T20:32:25.9097724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.9098016Z     
2025-05-07T20:32:25.9098219Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.9098399Z 
2025-05-07T20:32:25.9098504Z moe/activation_test.py:117: 
2025-05-07T20:32:25.9098883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.9099235Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.9099537Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.9100130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.9100713Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.9107983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.9108716Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.9109305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.9110030Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.9110728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.9111300Z     kernel = self.compile(
2025-05-07T20:32:25.9111878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.9112573Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.9113001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.9113248Z 
2025-05-07T20:32:25.9113646Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f92ffed0>
2025-05-07T20:32:25.9114779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.9116210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eaac680>}
2025-05-07T20:32:25.9117611Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.9118673Z context = <triton._C.libtriton.ir.context object at 0x7f31f8368bf0>
2025-05-07T20:32:25.9118973Z 
2025-05-07T20:32:25.9119154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.9119710Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.9120284Z                            module_map=module_map)
2025-05-07T20:32:25.9120672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.9121043Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.9121310Z E       ^
2025-05-07T20:32:25.9121793Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.9122260Z 
2025-05-07T20:32:25.9122857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:25.9123398Z 
2025-05-07T20:32:25.9123511Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:25.9123944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:25.9124367Z     T=1,
2025-05-07T20:32:25.9124559Z     D=7168,
2025-05-07T20:32:25.9124755Z     scale_ub=1200.0,
2025-05-07T20:32:25.9124987Z     contiguous=False,
2025-05-07T20:32:25.9125223Z     compiled=True,
2025-05-07T20:32:25.9125427Z )
2025-05-07T20:32:25.9125830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:25.9126346Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:25.9126623Z 
2025-05-07T20:32:25.9126711Z     @given(
2025-05-07T20:32:25.9126947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:25.9127341Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:25.9127663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:25.9128009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:25.9128358Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:25.9128659Z     )
2025-05-07T20:32:25.9129029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:25.9129488Z     def test_silu_mul_quant(
2025-05-07T20:32:25.9129739Z         self,
2025-05-07T20:32:25.9129936Z         T: int,
2025-05-07T20:32:25.9130140Z         D: int,
2025-05-07T20:32:25.9130367Z         scale_ub: Optional[float],
2025-05-07T20:32:25.9130651Z         contiguous: bool,
2025-05-07T20:32:25.9130899Z         compiled: bool,
2025-05-07T20:32:25.9131136Z     ) -> None:
2025-05-07T20:32:25.9131360Z         torch.manual_seed(2025)
2025-05-07T20:32:25.9131609Z     
2025-05-07T20:32:25.9131893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:25.9132261Z     
2025-05-07T20:32:25.9132462Z         x_sign = torch.sign(x)
2025-05-07T20:32:25.9132766Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:25.9133097Z         x = x_sign * x_clamp
2025-05-07T20:32:25.9133339Z         x0 = x[:, :D]
2025-05-07T20:32:25.9133563Z         x1 = x[:, D:]
2025-05-07T20:32:25.9133778Z     
2025-05-07T20:32:25.9133966Z         if contiguous:
2025-05-07T20:32:25.9134211Z             x0 = x0.contiguous()
2025-05-07T20:32:25.9134484Z             x1 = x1.contiguous()
2025-05-07T20:32:25.9134731Z     
2025-05-07T20:32:25.9134931Z         if scale_ub is not None:
2025-05-07T20:32:25.9135222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:25.9135570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:25.9135895Z             )
2025-05-07T20:32:25.9136093Z         else:
2025-05-07T20:32:25.9136317Z             scale_ub_tensor = None
2025-05-07T20:32:25.9136578Z     
2025-05-07T20:32:25.9136824Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:25.9137155Z             op = silu_mul_quant
2025-05-07T20:32:25.9137415Z             if compiled:
2025-05-07T20:32:25.9137672Z                 op = torch.compile(op)
2025-05-07T20:32:25.9137979Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.9138261Z     
2025-05-07T20:32:25.9138459Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:25.9138629Z 
2025-05-07T20:32:25.9138738Z moe/activation_test.py:117: 
2025-05-07T20:32:25.9139039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.9139391Z moe/activation_test.py:115: in fn
2025-05-07T20:32:25.9139687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:25.9140280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:25.9140859Z     return fn(*args, **kwargs)
2025-05-07T20:32:25.9141625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:25.9142340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:25.9142902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:25.9143622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:25.9144317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:25.9144874Z     kernel = self.compile(
2025-05-07T20:32:25.9145430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:25.9146167Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:25.9146584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:25.9146862Z 
2025-05-07T20:32:25.9147086Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9bda050>
2025-05-07T20:32:25.9148207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:25.9149639Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e737240>}
2025-05-07T20:32:25.9151029Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:25.9152095Z context = <triton._C.libtriton.ir.context object at 0x7f31f8394bf0>
2025-05-07T20:32:25.9152393Z 
2025-05-07T20:32:25.9152566Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:25.9153171Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:25.9153656Z                            module_map=module_map)
2025-05-07T20:32:25.9154036Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:25.9154402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:25.9154671Z E       ^
2025-05-07T20:32:25.9155155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:25.9155622Z 
2025-05-07T20:32:25.9156056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.1347240Z 
2025-05-07T20:32:26.1347601Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.1348463Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.1349338Z     T=1,
2025-05-07T20:32:26.1349734Z     D=7168,
2025-05-07T20:32:26.1350114Z     scale_ub=None,
2025-05-07T20:32:26.1350558Z     contiguous=False,
2025-05-07T20:32:26.1351006Z     compiled=True,
2025-05-07T20:32:26.1351409Z )
2025-05-07T20:32:26.1352033Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.1352988Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:26.1353404Z 
2025-05-07T20:32:26.1353491Z     @given(
2025-05-07T20:32:26.1353752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.1354094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.1354427Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.1354800Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.1355164Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.1355480Z     )
2025-05-07T20:32:26.1355856Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.1356335Z     def test_silu_mul_quant(
2025-05-07T20:32:26.1356609Z         self,
2025-05-07T20:32:26.1357021Z         T: int,
2025-05-07T20:32:26.1357253Z         D: int,
2025-05-07T20:32:26.1357493Z         scale_ub: Optional[float],
2025-05-07T20:32:26.1357781Z         contiguous: bool,
2025-05-07T20:32:26.1358051Z         compiled: bool,
2025-05-07T20:32:26.1358299Z     ) -> None:
2025-05-07T20:32:26.1358531Z         torch.manual_seed(2025)
2025-05-07T20:32:26.1358800Z     
2025-05-07T20:32:26.1359097Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.1359460Z     
2025-05-07T20:32:26.1359673Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.1360086Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.1360523Z         x = x_sign * x_clamp
2025-05-07T20:32:26.1360785Z         x0 = x[:, :D]
2025-05-07T20:32:26.1361022Z         x1 = x[:, D:]
2025-05-07T20:32:26.1361259Z     
2025-05-07T20:32:26.1361455Z         if contiguous:
2025-05-07T20:32:26.1361793Z             x0 = x0.contiguous()
2025-05-07T20:32:26.1362082Z             x1 = x1.contiguous()
2025-05-07T20:32:26.1362339Z     
2025-05-07T20:32:26.1362550Z         if scale_ub is not None:
2025-05-07T20:32:26.1362851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.1363211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.1363548Z             )
2025-05-07T20:32:26.1363763Z         else:
2025-05-07T20:32:26.1363988Z             scale_ub_tensor = None
2025-05-07T20:32:26.1364263Z     
2025-05-07T20:32:26.1364517Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.1364850Z             op = silu_mul_quant
2025-05-07T20:32:26.1365134Z             if compiled:
2025-05-07T20:32:26.1365405Z                 op = torch.compile(op)
2025-05-07T20:32:26.1365731Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.1366027Z     
2025-05-07T20:32:26.1366240Z         y_fp8, y_scale = fn()
2025-05-07T20:32:26.1366553Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:26.1366866Z     
2025-05-07T20:32:26.1367133Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.1367496Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:26.1367814Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:26.1368155Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:26.1368547Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.1368878Z     
2025-05-07T20:32:26.1369099Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:26.1369307Z 
2025-05-07T20:32:26.1369426Z moe/activation_test.py:126: 
2025-05-07T20:32:26.1369756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.1370115Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:26.1370478Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:26.1371318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:26.1372104Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:26.1372689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.1373414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.1374144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:26.1374907Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:26.1375683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:26.1376361Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:26.1377004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:26.1377551Z     fn()
2025-05-07T20:32:26.1378179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:26.1378795Z     self.fn.run(
2025-05-07T20:32:26.1379298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.1379867Z     kernel = self.compile(
2025-05-07T20:32:26.1380446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.1381138Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.1381611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.1381862Z 
2025-05-07T20:32:26.1382084Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f91d8fd0>
2025-05-07T20:32:26.1383250Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.1384758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e736980>}
2025-05-07T20:32:26.1386155Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.1387231Z context = <triton._C.libtriton.ir.context object at 0x7f31f85013b0>
2025-05-07T20:32:26.1387549Z 
2025-05-07T20:32:26.1387731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.1388300Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.1388801Z                            module_map=module_map)
2025-05-07T20:32:26.1389204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.1389596Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:26.1389881Z E       ^
2025-05-07T20:32:26.1390379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.1390858Z 
2025-05-07T20:32:26.1391298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.1391835Z 
2025-05-07T20:32:26.1391957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.1392406Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.1392844Z     T=1,
2025-05-07T20:32:26.1393077Z     D=5120,
2025-05-07T20:32:26.1393310Z     scale_ub=1200.0,
2025-05-07T20:32:26.1393558Z     contiguous=False,
2025-05-07T20:32:26.1393813Z     compiled=True,
2025-05-07T20:32:26.1394036Z )
2025-05-07T20:32:26.1394387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.1394908Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.1395190Z 
2025-05-07T20:32:26.1395286Z     @given(
2025-05-07T20:32:26.1395535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.1395876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.1396211Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.1396566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.1396926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.1397240Z     )
2025-05-07T20:32:26.1397614Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.1398087Z     def test_silu_mul_quant(
2025-05-07T20:32:26.1398353Z         self,
2025-05-07T20:32:26.1398567Z         T: int,
2025-05-07T20:32:26.1398783Z         D: int,
2025-05-07T20:32:26.1399022Z         scale_ub: Optional[float],
2025-05-07T20:32:26.1399404Z         contiguous: bool,
2025-05-07T20:32:26.1399661Z         compiled: bool,
2025-05-07T20:32:26.1399905Z     ) -> None:
2025-05-07T20:32:26.1400215Z         torch.manual_seed(2025)
2025-05-07T20:32:26.1400480Z     
2025-05-07T20:32:26.1400798Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.1401171Z     
2025-05-07T20:32:26.1401392Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.1401703Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.1402038Z         x = x_sign * x_clamp
2025-05-07T20:32:26.1402352Z         x0 = x[:, :D]
2025-05-07T20:32:26.1402583Z         x1 = x[:, D:]
2025-05-07T20:32:26.1402816Z     
2025-05-07T20:32:26.1403019Z         if contiguous:
2025-05-07T20:32:26.1403266Z             x0 = x0.contiguous()
2025-05-07T20:32:26.1403543Z             x1 = x1.contiguous()
2025-05-07T20:32:26.1403804Z     
2025-05-07T20:32:26.1404055Z         if scale_ub is not None:
2025-05-07T20:32:26.1404356Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.1404724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.1405053Z             )
2025-05-07T20:32:26.1405269Z         else:
2025-05-07T20:32:26.1405499Z             scale_ub_tensor = None
2025-05-07T20:32:26.1405768Z     
2025-05-07T20:32:26.1406021Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.1406363Z             op = silu_mul_quant
2025-05-07T20:32:26.1406635Z             if compiled:
2025-05-07T20:32:26.1406902Z                 op = torch.compile(op)
2025-05-07T20:32:26.1407225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.1407528Z     
2025-05-07T20:32:26.1407731Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.1407912Z 
2025-05-07T20:32:26.1408021Z moe/activation_test.py:117: 
2025-05-07T20:32:26.1408346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.1408706Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.1409018Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.1409616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.1410202Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.1410896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.1411624Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.1412193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.1412911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.1413778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.1414341Z     kernel = self.compile(
2025-05-07T20:32:26.1414920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.1415602Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.1416025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.1416271Z 
2025-05-07T20:32:26.1416496Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f967ebd0>
2025-05-07T20:32:26.1417623Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.1419048Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a45d00>}
2025-05-07T20:32:26.1420573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.1421643Z context = <triton._C.libtriton.ir.context object at 0x7f31f85a5230>
2025-05-07T20:32:26.1421945Z 
2025-05-07T20:32:26.1422129Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.1422684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.1423237Z                            module_map=module_map)
2025-05-07T20:32:26.1423631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.1424100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.1424378Z E       ^
2025-05-07T20:32:26.1424874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.1425344Z 
2025-05-07T20:32:26.1425792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.2812405Z 
2025-05-07T20:32:26.2812652Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.2813143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.2813836Z     T=1,
2025-05-07T20:32:26.2814047Z     D=5120,
2025-05-07T20:32:26.2814260Z     scale_ub=1200.0,
2025-05-07T20:32:26.2814517Z     contiguous=False,
2025-05-07T20:32:26.2814776Z     compiled=False,
2025-05-07T20:32:26.2815007Z )
2025-05-07T20:32:26.2815381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.2815982Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.2816302Z 
2025-05-07T20:32:26.2816394Z     @given(
2025-05-07T20:32:26.2816653Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.2817022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.2817379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.2817777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.2818166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.2818503Z     )
2025-05-07T20:32:26.2818912Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.2819451Z     def test_silu_mul_quant(
2025-05-07T20:32:26.2819731Z         self,
2025-05-07T20:32:26.2819943Z         T: int,
2025-05-07T20:32:26.2820165Z         D: int,
2025-05-07T20:32:26.2820413Z         scale_ub: Optional[float],
2025-05-07T20:32:26.2820722Z         contiguous: bool,
2025-05-07T20:32:26.2821001Z         compiled: bool,
2025-05-07T20:32:26.2821258Z     ) -> None:
2025-05-07T20:32:26.2821495Z         torch.manual_seed(2025)
2025-05-07T20:32:26.2821776Z     
2025-05-07T20:32:26.2822086Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.2822482Z     
2025-05-07T20:32:26.2822701Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.2823068Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.2823433Z         x = x_sign * x_clamp
2025-05-07T20:32:26.2823708Z         x0 = x[:, :D]
2025-05-07T20:32:26.2823947Z         x1 = x[:, D:]
2025-05-07T20:32:26.2824184Z     
2025-05-07T20:32:26.2824392Z         if contiguous:
2025-05-07T20:32:26.2824648Z             x0 = x0.contiguous()
2025-05-07T20:32:26.2824947Z             x1 = x1.contiguous()
2025-05-07T20:32:26.2825222Z     
2025-05-07T20:32:26.2825432Z         if scale_ub is not None:
2025-05-07T20:32:26.2825753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.2826148Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.2826520Z             )
2025-05-07T20:32:26.2826740Z         else:
2025-05-07T20:32:26.2826970Z             scale_ub_tensor = None
2025-05-07T20:32:26.2827263Z     
2025-05-07T20:32:26.2827526Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.2827891Z             op = silu_mul_quant
2025-05-07T20:32:26.2828388Z             if compiled:
2025-05-07T20:32:26.2838411Z                 op = torch.compile(op)
2025-05-07T20:32:26.2838773Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.2839105Z     
2025-05-07T20:32:26.2839335Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.2839530Z 
2025-05-07T20:32:26.2839646Z moe/activation_test.py:117: 
2025-05-07T20:32:26.2840001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.2840488Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.2840795Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.2841705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.2842462Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.2843053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.2843872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.2844592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.2845167Z     kernel = self.compile(
2025-05-07T20:32:26.2845748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.2846459Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.2846898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.2847147Z 
2025-05-07T20:32:26.2847378Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959f550>
2025-05-07T20:32:26.2848523Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.2849994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a3f240>}
2025-05-07T20:32:26.2851413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.2852499Z context = <triton._C.libtriton.ir.context object at 0x7f31f80b9730>
2025-05-07T20:32:26.2852806Z 
2025-05-07T20:32:26.2852999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.2853552Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.2854056Z                            module_map=module_map)
2025-05-07T20:32:26.2854453Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.2854839Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.2855133Z E       ^
2025-05-07T20:32:26.2855632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.2856105Z 
2025-05-07T20:32:26.2856553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.2857089Z 
2025-05-07T20:32:26.2857201Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.2857647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.2858078Z     T=16384,
2025-05-07T20:32:26.2858284Z     D=5120,
2025-05-07T20:32:26.2858498Z     scale_ub=1200.0,
2025-05-07T20:32:26.2858744Z     contiguous=False,
2025-05-07T20:32:26.2858984Z     compiled=True,
2025-05-07T20:32:26.2859213Z )
2025-05-07T20:32:26.2859556Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.2860182Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.2860486Z 
2025-05-07T20:32:26.2860572Z     @given(
2025-05-07T20:32:26.2860821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.2861165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.2861488Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.2861842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.2862198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.2862500Z     )
2025-05-07T20:32:26.2862878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.2863416Z     def test_silu_mul_quant(
2025-05-07T20:32:26.2863678Z         self,
2025-05-07T20:32:26.2863885Z         T: int,
2025-05-07T20:32:26.2864101Z         D: int,
2025-05-07T20:32:26.2864347Z         scale_ub: Optional[float],
2025-05-07T20:32:26.2864634Z         contiguous: bool,
2025-05-07T20:32:26.2864940Z         compiled: bool,
2025-05-07T20:32:26.2865193Z     ) -> None:
2025-05-07T20:32:26.2865423Z         torch.manual_seed(2025)
2025-05-07T20:32:26.2865688Z     
2025-05-07T20:32:26.2865984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.2866345Z     
2025-05-07T20:32:26.2866560Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.2866876Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.2867207Z         x = x_sign * x_clamp
2025-05-07T20:32:26.2867472Z         x0 = x[:, :D]
2025-05-07T20:32:26.2867715Z         x1 = x[:, D:]
2025-05-07T20:32:26.2867936Z     
2025-05-07T20:32:26.2868147Z         if contiguous:
2025-05-07T20:32:26.2868400Z             x0 = x0.contiguous()
2025-05-07T20:32:26.2868672Z             x1 = x1.contiguous()
2025-05-07T20:32:26.2868935Z     
2025-05-07T20:32:26.2869146Z         if scale_ub is not None:
2025-05-07T20:32:26.2869444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.2869801Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.2870139Z             )
2025-05-07T20:32:26.2870353Z         else:
2025-05-07T20:32:26.2870576Z             scale_ub_tensor = None
2025-05-07T20:32:26.2870848Z     
2025-05-07T20:32:26.2871102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.2871433Z             op = silu_mul_quant
2025-05-07T20:32:26.2871705Z             if compiled:
2025-05-07T20:32:26.2871977Z                 op = torch.compile(op)
2025-05-07T20:32:26.2872292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.2872598Z     
2025-05-07T20:32:26.2872812Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.2873011Z 
2025-05-07T20:32:26.2873142Z moe/activation_test.py:117: 
2025-05-07T20:32:26.2873464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.2873821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.2874123Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.2874712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.2875304Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.2875995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.2876708Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.2877278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.2877999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.2878700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.2879256Z     kernel = self.compile(
2025-05-07T20:32:26.2879831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.2880674Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.2881107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.2881351Z 
2025-05-07T20:32:26.2881568Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f92feed0>
2025-05-07T20:32:26.2882695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.2884125Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a3dd00>}
2025-05-07T20:32:26.2885568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.2886678Z context = <triton._C.libtriton.ir.context object at 0x7f31f80aea70>
2025-05-07T20:32:26.2886987Z 
2025-05-07T20:32:26.2887166Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.2887723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.2888213Z                            module_map=module_map)
2025-05-07T20:32:26.2888595Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.2888971Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.2889244Z E       ^
2025-05-07T20:32:26.2889725Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.2890199Z 
2025-05-07T20:32:26.2890638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.2891181Z 
2025-05-07T20:32:26.2891290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.2891730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.2892161Z     T=2048,
2025-05-07T20:32:26.2892376Z     D=7168,
2025-05-07T20:32:26.2892594Z     scale_ub=1200.0,
2025-05-07T20:32:26.2892832Z     contiguous=False,
2025-05-07T20:32:26.2893079Z     compiled=True,
2025-05-07T20:32:26.4717045Z )
2025-05-07T20:32:26.4717680Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.4718291Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.4718723Z 
2025-05-07T20:32:26.4718848Z     @given(
2025-05-07T20:32:26.4719184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.4719645Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.4720088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.4720708Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.4721207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.4721615Z     )
2025-05-07T20:32:26.4722045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.4722518Z     def test_silu_mul_quant(
2025-05-07T20:32:26.4722776Z         self,
2025-05-07T20:32:26.4722981Z         T: int,
2025-05-07T20:32:26.4723193Z         D: int,
2025-05-07T20:32:26.4723422Z         scale_ub: Optional[float],
2025-05-07T20:32:26.4723701Z         contiguous: bool,
2025-05-07T20:32:26.4723957Z         compiled: bool,
2025-05-07T20:32:26.4724199Z     ) -> None:
2025-05-07T20:32:26.4724432Z         torch.manual_seed(2025)
2025-05-07T20:32:26.4724683Z     
2025-05-07T20:32:26.4724978Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.4725340Z     
2025-05-07T20:32:26.4725541Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.4725850Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.4726183Z         x = x_sign * x_clamp
2025-05-07T20:32:26.4726687Z         x0 = x[:, :D]
2025-05-07T20:32:26.4726924Z         x1 = x[:, D:]
2025-05-07T20:32:26.4727149Z     
2025-05-07T20:32:26.4727341Z         if contiguous:
2025-05-07T20:32:26.4727593Z             x0 = x0.contiguous()
2025-05-07T20:32:26.4727873Z             x1 = x1.contiguous()
2025-05-07T20:32:26.4728122Z     
2025-05-07T20:32:26.4728324Z         if scale_ub is not None:
2025-05-07T20:32:26.4728614Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.4728970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.4729292Z             )
2025-05-07T20:32:26.4729562Z         else:
2025-05-07T20:32:26.4729786Z             scale_ub_tensor = None
2025-05-07T20:32:26.4730043Z     
2025-05-07T20:32:26.4730286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.4730618Z             op = silu_mul_quant
2025-05-07T20:32:26.4730873Z             if compiled:
2025-05-07T20:32:26.4731197Z                 op = torch.compile(op)
2025-05-07T20:32:26.4731514Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4731800Z     
2025-05-07T20:32:26.4732003Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.4732173Z 
2025-05-07T20:32:26.4732285Z moe/activation_test.py:117: 
2025-05-07T20:32:26.4732593Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4732973Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.4733291Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4733880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.4734469Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.4735199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.4735909Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.4736478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.4737191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.4737884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.4738437Z     kernel = self.compile(
2025-05-07T20:32:26.4739003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.4739688Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.4740106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4740354Z 
2025-05-07T20:32:26.4740569Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959e650>
2025-05-07T20:32:26.4741705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.4743146Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a3fc40>}
2025-05-07T20:32:26.4744543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.4745600Z context = <triton._C.libtriton.ir.context object at 0x7f31f80e2130>
2025-05-07T20:32:26.4745911Z 
2025-05-07T20:32:26.4746089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.4746642Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.4747138Z                            module_map=module_map)
2025-05-07T20:32:26.4747604Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.4747982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.4748260Z E       ^
2025-05-07T20:32:26.4748745Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.4749222Z 
2025-05-07T20:32:26.4749659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.4750201Z 
2025-05-07T20:32:26.4750310Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.4750749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.4751213Z     T=1,
2025-05-07T20:32:26.4751411Z     D=5120,
2025-05-07T20:32:26.4751628Z     scale_ub=None,
2025-05-07T20:32:26.4751853Z     contiguous=False,
2025-05-07T20:32:26.4752093Z     compiled=False,
2025-05-07T20:32:26.4752311Z )
2025-05-07T20:32:26.4752687Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.4753210Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:26.4753484Z 
2025-05-07T20:32:26.4753574Z     @given(
2025-05-07T20:32:26.4753810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.4754142Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.4754466Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.4754813Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.4755157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.4755457Z     )
2025-05-07T20:32:26.4755827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.4756284Z     def test_silu_mul_quant(
2025-05-07T20:32:26.4756540Z         self,
2025-05-07T20:32:26.4756745Z         T: int,
2025-05-07T20:32:26.4756950Z         D: int,
2025-05-07T20:32:26.4757180Z         scale_ub: Optional[float],
2025-05-07T20:32:26.4757470Z         contiguous: bool,
2025-05-07T20:32:26.4757726Z         compiled: bool,
2025-05-07T20:32:26.4757961Z     ) -> None:
2025-05-07T20:32:26.4758190Z         torch.manual_seed(2025)
2025-05-07T20:32:26.4758441Z     
2025-05-07T20:32:26.4758729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.4759091Z     
2025-05-07T20:32:26.4759297Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.4759600Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.4759928Z         x = x_sign * x_clamp
2025-05-07T20:32:26.4760247Z         x0 = x[:, :D]
2025-05-07T20:32:26.4760475Z         x1 = x[:, D:]
2025-05-07T20:32:26.4760699Z     
2025-05-07T20:32:26.4760897Z         if contiguous:
2025-05-07T20:32:26.4761141Z             x0 = x0.contiguous()
2025-05-07T20:32:26.4761415Z             x1 = x1.contiguous()
2025-05-07T20:32:26.4761671Z     
2025-05-07T20:32:26.4761871Z         if scale_ub is not None:
2025-05-07T20:32:26.4762167Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.4762529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.4762879Z             )
2025-05-07T20:32:26.4763112Z         else:
2025-05-07T20:32:26.4763339Z             scale_ub_tensor = None
2025-05-07T20:32:26.4763600Z     
2025-05-07T20:32:26.4763843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.4764177Z             op = silu_mul_quant
2025-05-07T20:32:26.4764444Z             if compiled:
2025-05-07T20:32:26.4764702Z                 op = torch.compile(op)
2025-05-07T20:32:26.4765016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4765310Z     
2025-05-07T20:32:26.4765505Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.4765684Z 
2025-05-07T20:32:26.4765789Z moe/activation_test.py:117: 
2025-05-07T20:32:26.4766107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4766456Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.4766756Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4767563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.4768284Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.4768841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.4769553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.4770245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.4770835Z     kernel = self.compile(
2025-05-07T20:32:26.4771400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.4772082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.4772539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4772783Z 
2025-05-07T20:32:26.4772999Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a3450>
2025-05-07T20:32:26.4774121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.4775540Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f96afc40>}
2025-05-07T20:32:26.4776936Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.4777996Z context = <triton._C.libtriton.ir.context object at 0x7f30b7fb80b0>
2025-05-07T20:32:26.4778300Z 
2025-05-07T20:32:26.4778478Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.4779024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.4779518Z                            module_map=module_map)
2025-05-07T20:32:26.4779894Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.4780264Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.4780536Z E       ^
2025-05-07T20:32:26.4781016Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.4781494Z 
2025-05-07T20:32:26.4781927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.4782466Z 
2025-05-07T20:32:26.4782575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.4783011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.4783430Z     T=4096,
2025-05-07T20:32:26.4783633Z     D=7168,
2025-05-07T20:32:26.4783838Z     scale_ub=1200.0,
2025-05-07T20:32:26.4784066Z     contiguous=False,
2025-05-07T20:32:26.4784304Z     compiled=False,
2025-05-07T20:32:26.4784519Z )
2025-05-07T20:32:26.4784857Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.4785379Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.4785677Z 
2025-05-07T20:32:26.4785758Z     @given(
2025-05-07T20:32:26.4786004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.4786332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.4786656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.4787007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.4787347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.4787654Z     )
2025-05-07T20:32:26.4788134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.4788600Z     def test_silu_mul_quant(
2025-05-07T20:32:26.4788850Z         self,
2025-05-07T20:32:26.4789059Z         T: int,
2025-05-07T20:32:26.4789268Z         D: int,
2025-05-07T20:32:26.4789495Z         scale_ub: Optional[float],
2025-05-07T20:32:26.4789781Z         contiguous: bool,
2025-05-07T20:32:26.4790040Z         compiled: bool,
2025-05-07T20:32:26.4790266Z     ) -> None:
2025-05-07T20:32:26.4790490Z         torch.manual_seed(2025)
2025-05-07T20:32:26.4790743Z     
2025-05-07T20:32:26.4791021Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.4791424Z     
2025-05-07T20:32:26.4791627Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.4791926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.4792251Z         x = x_sign * x_clamp
2025-05-07T20:32:26.4792508Z         x0 = x[:, :D]
2025-05-07T20:32:26.4792774Z         x1 = x[:, D:]
2025-05-07T20:32:26.4792992Z     
2025-05-07T20:32:26.4793191Z         if contiguous:
2025-05-07T20:32:26.4793429Z             x0 = x0.contiguous()
2025-05-07T20:32:26.4793701Z             x1 = x1.contiguous()
2025-05-07T20:32:26.4793951Z     
2025-05-07T20:32:26.4794150Z         if scale_ub is not None:
2025-05-07T20:32:26.4794435Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.4794786Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.4795108Z             )
2025-05-07T20:32:26.4795305Z         else:
2025-05-07T20:32:26.4795523Z             scale_ub_tensor = None
2025-05-07T20:32:26.4795787Z     
2025-05-07T20:32:26.4796024Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.4796356Z             op = silu_mul_quant
2025-05-07T20:32:26.4796625Z             if compiled:
2025-05-07T20:32:26.4796875Z                 op = torch.compile(op)
2025-05-07T20:32:26.4797185Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4797480Z     
2025-05-07T20:32:26.4797680Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.4797857Z 
2025-05-07T20:32:26.4797959Z moe/activation_test.py:117: 
2025-05-07T20:32:26.4798271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4798621Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.4798910Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.4799633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.4800427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.4800990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.4801705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.4802402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.4802965Z     kernel = self.compile(
2025-05-07T20:32:26.4803531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.4804215Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.4804631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.4804876Z 
2025-05-07T20:32:26.4805092Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8deb5d0>
2025-05-07T20:32:26.4806213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.4807635Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0ad40>}
2025-05-07T20:32:26.4809118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.4810179Z context = <triton._C.libtriton.ir.context object at 0x7f31f8c74430>
2025-05-07T20:32:26.4810482Z 
2025-05-07T20:32:26.4810657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.4811208Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.4811697Z                            module_map=module_map)
2025-05-07T20:32:26.4812121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.4812487Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.4812759Z E       ^
2025-05-07T20:32:26.4813244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.4813944Z 
2025-05-07T20:32:26.4814386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.6368024Z 
2025-05-07T20:32:26.6368388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.6369030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.6369621Z     T=16384,
2025-05-07T20:32:26.6369934Z     D=7168,
2025-05-07T20:32:26.6370209Z     scale_ub=None,
2025-05-07T20:32:26.6370437Z     contiguous=True,
2025-05-07T20:32:26.6370665Z     compiled=True,
2025-05-07T20:32:26.6370880Z )
2025-05-07T20:32:26.6371224Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.6371744Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:26.6372040Z 
2025-05-07T20:32:26.6372125Z     @given(
2025-05-07T20:32:26.6372374Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.6372708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.6373069Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.6373445Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.6373797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.6374096Z     )
2025-05-07T20:32:26.6374467Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.6383788Z     def test_silu_mul_quant(
2025-05-07T20:32:26.6384209Z         self,
2025-05-07T20:32:26.6384508Z         T: int,
2025-05-07T20:32:26.6384709Z         D: int,
2025-05-07T20:32:26.6384942Z         scale_ub: Optional[float],
2025-05-07T20:32:26.6385223Z         contiguous: bool,
2025-05-07T20:32:26.6385468Z         compiled: bool,
2025-05-07T20:32:26.6385700Z     ) -> None:
2025-05-07T20:32:26.6385924Z         torch.manual_seed(2025)
2025-05-07T20:32:26.6386171Z     
2025-05-07T20:32:26.6386458Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.6386815Z     
2025-05-07T20:32:26.6387014Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.6387320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.6387639Z         x = x_sign * x_clamp
2025-05-07T20:32:26.6387884Z         x0 = x[:, :D]
2025-05-07T20:32:26.6388108Z         x1 = x[:, D:]
2025-05-07T20:32:26.6388319Z     
2025-05-07T20:32:26.6388505Z         if contiguous:
2025-05-07T20:32:26.6388742Z             x0 = x0.contiguous()
2025-05-07T20:32:26.6389009Z             x1 = x1.contiguous()
2025-05-07T20:32:26.6389254Z     
2025-05-07T20:32:26.6389448Z         if scale_ub is not None:
2025-05-07T20:32:26.6389739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.6390089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.6390404Z             )
2025-05-07T20:32:26.6390602Z         else:
2025-05-07T20:32:26.6390818Z             scale_ub_tensor = None
2025-05-07T20:32:26.6391072Z     
2025-05-07T20:32:26.6391314Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.6391816Z             op = silu_mul_quant
2025-05-07T20:32:26.6392078Z             if compiled:
2025-05-07T20:32:26.6392333Z                 op = torch.compile(op)
2025-05-07T20:32:26.6392639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.6392919Z     
2025-05-07T20:32:26.6393118Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.6393289Z 
2025-05-07T20:32:26.6393394Z moe/activation_test.py:117: 
2025-05-07T20:32:26.6393701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.6394044Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.6394406Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.6394986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.6395562Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.6396250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.6397021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.6397573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.6398282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.6398969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.6399518Z     kernel = self.compile(
2025-05-07T20:32:26.6400071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.6400888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.6401297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.6401531Z 
2025-05-07T20:32:26.6401755Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8c4b250>
2025-05-07T20:32:26.6402872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.6404291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0b920>}
2025-05-07T20:32:26.6405676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.6406736Z context = <triton._C.libtriton.ir.context object at 0x7f31f8c536b0>
2025-05-07T20:32:26.6407033Z 
2025-05-07T20:32:26.6407213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.6407762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.6408248Z                            module_map=module_map)
2025-05-07T20:32:26.6408627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.6408988Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.6409260Z E       ^
2025-05-07T20:32:26.6409742Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.6410207Z 
2025-05-07T20:32:26.6410641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.6411173Z 
2025-05-07T20:32:26.6411277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.6411707Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.6412117Z     T=4096,
2025-05-07T20:32:26.6412305Z     D=5120,
2025-05-07T20:32:26.6412500Z     scale_ub=None,
2025-05-07T20:32:26.6412812Z     contiguous=False,
2025-05-07T20:32:26.6413040Z     compiled=True,
2025-05-07T20:32:26.6413246Z )
2025-05-07T20:32:26.6413845Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.6414360Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:26.6414649Z 
2025-05-07T20:32:26.6414728Z     @given(
2025-05-07T20:32:26.6414962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.6415287Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.6415602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.6416029Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.6416373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.6416663Z     )
2025-05-07T20:32:26.6417030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.6417545Z     def test_silu_mul_quant(
2025-05-07T20:32:26.6417787Z         self,
2025-05-07T20:32:26.6417992Z         T: int,
2025-05-07T20:32:26.6418193Z         D: int,
2025-05-07T20:32:26.6418412Z         scale_ub: Optional[float],
2025-05-07T20:32:26.6418685Z         contiguous: bool,
2025-05-07T20:32:26.6418929Z         compiled: bool,
2025-05-07T20:32:26.6419151Z     ) -> None:
2025-05-07T20:32:26.6419362Z         torch.manual_seed(2025)
2025-05-07T20:32:26.6419607Z     
2025-05-07T20:32:26.6419883Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.6420230Z     
2025-05-07T20:32:26.6420424Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.6420728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.6421040Z         x = x_sign * x_clamp
2025-05-07T20:32:26.6421280Z         x0 = x[:, :D]
2025-05-07T20:32:26.6421503Z         x1 = x[:, D:]
2025-05-07T20:32:26.6421707Z     
2025-05-07T20:32:26.6421896Z         if contiguous:
2025-05-07T20:32:26.6422135Z             x0 = x0.contiguous()
2025-05-07T20:32:26.6422393Z             x1 = x1.contiguous()
2025-05-07T20:32:26.6422641Z     
2025-05-07T20:32:26.6422839Z         if scale_ub is not None:
2025-05-07T20:32:26.6423111Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.6423462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.6423778Z             )
2025-05-07T20:32:26.6423972Z         else:
2025-05-07T20:32:26.6424180Z             scale_ub_tensor = None
2025-05-07T20:32:26.6424433Z     
2025-05-07T20:32:26.6424672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.6424992Z             op = silu_mul_quant
2025-05-07T20:32:26.6425245Z             if compiled:
2025-05-07T20:32:26.6425495Z                 op = torch.compile(op)
2025-05-07T20:32:26.6425796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.6426077Z     
2025-05-07T20:32:26.6426280Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.6426450Z 
2025-05-07T20:32:26.6426548Z moe/activation_test.py:117: 
2025-05-07T20:32:26.6426851Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.6427203Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.6427491Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.6428063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.6428643Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.6429326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.6430032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.6430587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.6431290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.6432093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.6432643Z     kernel = self.compile(
2025-05-07T20:32:26.6433203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.6433880Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.6434285Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.6434526Z 
2025-05-07T20:32:26.6434743Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8850c50>
2025-05-07T20:32:26.6435858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.6437316Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9be04a0>}
2025-05-07T20:32:26.6438746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.6439790Z context = <triton._C.libtriton.ir.context object at 0x7f31f8c46fb0>
2025-05-07T20:32:26.6440092Z 
2025-05-07T20:32:26.6440347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.6440895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.6441378Z                            module_map=module_map)
2025-05-07T20:32:26.6441748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.6442112Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.6442375Z E       ^
2025-05-07T20:32:26.6442848Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.6443324Z 
2025-05-07T20:32:26.6443751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7804688Z 
2025-05-07T20:32:26.7804837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7805305Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7805960Z     T=4096,
2025-05-07T20:32:26.7806239Z     D=5120,
2025-05-07T20:32:26.7806519Z     scale_ub=1200.0,
2025-05-07T20:32:26.7806837Z     contiguous=False,
2025-05-07T20:32:26.7807105Z     compiled=False,
2025-05-07T20:32:26.7807317Z )
2025-05-07T20:32:26.7807647Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.7808172Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.7808464Z 
2025-05-07T20:32:26.7808545Z     @given(
2025-05-07T20:32:26.7808784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.7809112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.7809433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.7809780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.7810115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.7810413Z     )
2025-05-07T20:32:26.7810778Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.7811235Z     def test_silu_mul_quant(
2025-05-07T20:32:26.7811485Z         self,
2025-05-07T20:32:26.7811690Z         T: int,
2025-05-07T20:32:26.7811892Z         D: int,
2025-05-07T20:32:26.7812123Z         scale_ub: Optional[float],
2025-05-07T20:32:26.7812406Z         contiguous: bool,
2025-05-07T20:32:26.7812655Z         compiled: bool,
2025-05-07T20:32:26.7812889Z     ) -> None:
2025-05-07T20:32:26.7813114Z         torch.manual_seed(2025)
2025-05-07T20:32:26.7813552Z     
2025-05-07T20:32:26.7814001Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.7814366Z     
2025-05-07T20:32:26.7814570Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.7814870Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.7815198Z         x = x_sign * x_clamp
2025-05-07T20:32:26.7815452Z         x0 = x[:, :D]
2025-05-07T20:32:26.7815674Z         x1 = x[:, D:]
2025-05-07T20:32:26.7815892Z     
2025-05-07T20:32:26.7816086Z         if contiguous:
2025-05-07T20:32:26.7816323Z             x0 = x0.contiguous()
2025-05-07T20:32:26.7816595Z             x1 = x1.contiguous()
2025-05-07T20:32:26.7816939Z     
2025-05-07T20:32:26.7817134Z         if scale_ub is not None:
2025-05-07T20:32:26.7817417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.7817771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.7818092Z             )
2025-05-07T20:32:26.7818292Z         else:
2025-05-07T20:32:26.7818573Z             scale_ub_tensor = None
2025-05-07T20:32:26.7818837Z     
2025-05-07T20:32:26.7819081Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.7819412Z             op = silu_mul_quant
2025-05-07T20:32:26.7819673Z             if compiled:
2025-05-07T20:32:26.7819929Z                 op = torch.compile(op)
2025-05-07T20:32:26.7820239Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7820527Z     
2025-05-07T20:32:26.7820725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.7820900Z 
2025-05-07T20:32:26.7821005Z moe/activation_test.py:117: 
2025-05-07T20:32:26.7821314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7821664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.7821960Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7822682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.7823455Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.7824020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.7824731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.7825424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.7825973Z     kernel = self.compile(
2025-05-07T20:32:26.7826540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.7827228Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7827651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7827890Z 
2025-05-07T20:32:26.7828105Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e0e6150>
2025-05-07T20:32:26.7829234Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.7830663Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0c2c0>}
2025-05-07T20:32:26.7832058Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.7833124Z context = <triton._C.libtriton.ir.context object at 0x7f31f861ba70>
2025-05-07T20:32:26.7833428Z 
2025-05-07T20:32:26.7833602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.7834151Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7834645Z                            module_map=module_map)
2025-05-07T20:32:26.7835106Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7835485Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7835758Z E       ^
2025-05-07T20:32:26.7836244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.7836712Z 
2025-05-07T20:32:26.7837145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7837683Z 
2025-05-07T20:32:26.7837791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7838267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7838688Z     T=4096,
2025-05-07T20:32:26.7838878Z     D=5120,
2025-05-07T20:32:26.7839079Z     scale_ub=1200.0,
2025-05-07T20:32:26.7839310Z     contiguous=False,
2025-05-07T20:32:26.7839580Z     compiled=True,
2025-05-07T20:32:26.7839793Z )
2025-05-07T20:32:26.7840207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.7840721Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.7841007Z 
2025-05-07T20:32:26.7841086Z     @given(
2025-05-07T20:32:26.7841324Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.7841646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.7841966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.7842307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.7842652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.7842942Z     )
2025-05-07T20:32:26.7843303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.7843761Z     def test_silu_mul_quant(
2025-05-07T20:32:26.7844005Z         self,
2025-05-07T20:32:26.7844203Z         T: int,
2025-05-07T20:32:26.7844405Z         D: int,
2025-05-07T20:32:26.7844630Z         scale_ub: Optional[float],
2025-05-07T20:32:26.7844912Z         contiguous: bool,
2025-05-07T20:32:26.7845161Z         compiled: bool,
2025-05-07T20:32:26.7845391Z     ) -> None:
2025-05-07T20:32:26.7845612Z         torch.manual_seed(2025)
2025-05-07T20:32:26.7845861Z     
2025-05-07T20:32:26.7846137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.7846486Z     
2025-05-07T20:32:26.7846684Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.7846980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.7847300Z         x = x_sign * x_clamp
2025-05-07T20:32:26.7847551Z         x0 = x[:, :D]
2025-05-07T20:32:26.7847773Z         x1 = x[:, D:]
2025-05-07T20:32:26.7847983Z     
2025-05-07T20:32:26.7848177Z         if contiguous:
2025-05-07T20:32:26.7848416Z             x0 = x0.contiguous()
2025-05-07T20:32:26.7848681Z             x1 = x1.contiguous()
2025-05-07T20:32:26.7848933Z     
2025-05-07T20:32:26.7849129Z         if scale_ub is not None:
2025-05-07T20:32:26.7849411Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.7849760Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.7850081Z             )
2025-05-07T20:32:26.7850275Z         else:
2025-05-07T20:32:26.7850491Z             scale_ub_tensor = None
2025-05-07T20:32:26.7850753Z     
2025-05-07T20:32:26.7850986Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.7851315Z             op = silu_mul_quant
2025-05-07T20:32:26.7851575Z             if compiled:
2025-05-07T20:32:26.7851825Z                 op = torch.compile(op)
2025-05-07T20:32:26.7852134Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7852420Z     
2025-05-07T20:32:26.7852614Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.7852794Z 
2025-05-07T20:32:26.7852913Z moe/activation_test.py:117: 
2025-05-07T20:32:26.7853251Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7853678Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.7853967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.7854547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.7855124Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.7855801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.7856513Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.7857070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.7857818Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.7858500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.7859090Z     kernel = self.compile(
2025-05-07T20:32:26.7859655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.7860337Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.7860744Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.7860989Z 
2025-05-07T20:32:26.7861205Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8c494d0>
2025-05-07T20:32:26.7862320Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.7863741Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0db20>}
2025-05-07T20:32:26.7865131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.7866187Z context = <triton._C.libtriton.ir.context object at 0x7f31f86c9930>
2025-05-07T20:32:26.7866493Z 
2025-05-07T20:32:26.7866663Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.7867203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.7867698Z                            module_map=module_map)
2025-05-07T20:32:26.7868079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.7868446Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.7868716Z E       ^
2025-05-07T20:32:26.7869195Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.7869664Z 
2025-05-07T20:32:26.7870104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.7870633Z 
2025-05-07T20:32:26.7870746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.7871174Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.7871587Z     T=2048,
2025-05-07T20:32:26.7871774Z     D=7168,
2025-05-07T20:32:26.7871971Z     scale_ub=1200.0,
2025-05-07T20:32:26.7872204Z     contiguous=False,
2025-05-07T20:32:26.7872434Z     compiled=False,
2025-05-07T20:32:26.9827851Z )
2025-05-07T20:32:26.9828347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.9829161Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:26.9829597Z 
2025-05-07T20:32:26.9829708Z     @given(
2025-05-07T20:32:26.9830031Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.9830472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.9831100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.9831457Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.9831800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.9832098Z     )
2025-05-07T20:32:26.9832463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.9832921Z     def test_silu_mul_quant(
2025-05-07T20:32:26.9833212Z         self,
2025-05-07T20:32:26.9833424Z         T: int,
2025-05-07T20:32:26.9833622Z         D: int,
2025-05-07T20:32:26.9833850Z         scale_ub: Optional[float],
2025-05-07T20:32:26.9834201Z         contiguous: bool,
2025-05-07T20:32:26.9834461Z         compiled: bool,
2025-05-07T20:32:26.9834691Z     ) -> None:
2025-05-07T20:32:26.9834915Z         torch.manual_seed(2025)
2025-05-07T20:32:26.9835168Z     
2025-05-07T20:32:26.9835448Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.9835868Z     
2025-05-07T20:32:26.9836073Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.9836398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.9836717Z         x = x_sign * x_clamp
2025-05-07T20:32:26.9836969Z         x0 = x[:, :D]
2025-05-07T20:32:26.9837199Z         x1 = x[:, D:]
2025-05-07T20:32:26.9837409Z     
2025-05-07T20:32:26.9837619Z         if contiguous:
2025-05-07T20:32:26.9837939Z             x0 = x0.contiguous()
2025-05-07T20:32:26.9838204Z             x1 = x1.contiguous()
2025-05-07T20:32:26.9838453Z     
2025-05-07T20:32:26.9838655Z         if scale_ub is not None:
2025-05-07T20:32:26.9838938Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.9839295Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.9839617Z             )
2025-05-07T20:32:26.9839818Z         else:
2025-05-07T20:32:26.9840032Z             scale_ub_tensor = None
2025-05-07T20:32:26.9840412Z     
2025-05-07T20:32:26.9840653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9840985Z             op = silu_mul_quant
2025-05-07T20:32:26.9841250Z             if compiled:
2025-05-07T20:32:26.9841509Z                 op = torch.compile(op)
2025-05-07T20:32:26.9841817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9842108Z     
2025-05-07T20:32:26.9842465Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.9851148Z 
2025-05-07T20:32:26.9851324Z moe/activation_test.py:117: 
2025-05-07T20:32:26.9851662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9852016Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.9852321Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9853045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.9853770Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.9854340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.9855063Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.9855761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.9856319Z     kernel = self.compile(
2025-05-07T20:32:26.9856892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.9857577Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.9857994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9858233Z 
2025-05-07T20:32:26.9858453Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f235ad0>
2025-05-07T20:32:26.9859700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.9861132Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0e700>}
2025-05-07T20:32:26.9862522Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.9863628Z context = <triton._C.libtriton.ir.context object at 0x7f31f8643db0>
2025-05-07T20:32:26.9863966Z 
2025-05-07T20:32:26.9864148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.9864684Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.9865164Z                            module_map=module_map)
2025-05-07T20:32:26.9865581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.9865950Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.9866216Z E       ^
2025-05-07T20:32:26.9866697Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.9867160Z 
2025-05-07T20:32:26.9867594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.9868122Z 
2025-05-07T20:32:26.9868229Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.9868660Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.9869080Z     T=1,
2025-05-07T20:32:26.9869267Z     D=7168,
2025-05-07T20:32:26.9869458Z     scale_ub=None,
2025-05-07T20:32:26.9869675Z     contiguous=True,
2025-05-07T20:32:26.9869912Z     compiled=False,
2025-05-07T20:32:26.9870118Z )
2025-05-07T20:32:26.9870455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.9870962Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:26.9871230Z 
2025-05-07T20:32:26.9871307Z     @given(
2025-05-07T20:32:26.9871541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.9871860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.9872173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.9872577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.9872982Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.9873275Z     )
2025-05-07T20:32:26.9873632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.9874090Z     def test_silu_mul_quant(
2025-05-07T20:32:26.9874341Z         self,
2025-05-07T20:32:26.9874536Z         T: int,
2025-05-07T20:32:26.9874737Z         D: int,
2025-05-07T20:32:26.9874960Z         scale_ub: Optional[float],
2025-05-07T20:32:26.9875240Z         contiguous: bool,
2025-05-07T20:32:26.9875487Z         compiled: bool,
2025-05-07T20:32:26.9875724Z     ) -> None:
2025-05-07T20:32:26.9875942Z         torch.manual_seed(2025)
2025-05-07T20:32:26.9876191Z     
2025-05-07T20:32:26.9876473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.9876818Z     
2025-05-07T20:32:26.9877015Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.9877317Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.9877639Z         x = x_sign * x_clamp
2025-05-07T20:32:26.9877882Z         x0 = x[:, :D]
2025-05-07T20:32:26.9878106Z         x1 = x[:, D:]
2025-05-07T20:32:26.9878323Z     
2025-05-07T20:32:26.9878509Z         if contiguous:
2025-05-07T20:32:26.9878826Z             x0 = x0.contiguous()
2025-05-07T20:32:26.9879098Z             x1 = x1.contiguous()
2025-05-07T20:32:26.9879338Z     
2025-05-07T20:32:26.9879530Z         if scale_ub is not None:
2025-05-07T20:32:26.9879822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.9880362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.9880692Z             )
2025-05-07T20:32:26.9880886Z         else:
2025-05-07T20:32:26.9881095Z             scale_ub_tensor = None
2025-05-07T20:32:26.9881355Z     
2025-05-07T20:32:26.9881593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9881910Z             op = silu_mul_quant
2025-05-07T20:32:26.9882165Z             if compiled:
2025-05-07T20:32:26.9882416Z                 op = torch.compile(op)
2025-05-07T20:32:26.9882713Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9882998Z     
2025-05-07T20:32:26.9883366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.9883618Z 
2025-05-07T20:32:26.9883752Z moe/activation_test.py:117: 
2025-05-07T20:32:26.9884058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9884399Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.9884750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9885476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.9886184Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.9886740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.9887441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.9888126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.9888674Z     kernel = self.compile(
2025-05-07T20:32:26.9889233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.9889910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.9890318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9890561Z 
2025-05-07T20:32:26.9890777Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8deaa50>
2025-05-07T20:32:26.9891889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.9893329Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0fa60>}
2025-05-07T20:32:26.9894739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.9895786Z context = <triton._C.libtriton.ir.context object at 0x7f30b7f56a30>
2025-05-07T20:32:26.9896094Z 
2025-05-07T20:32:26.9896271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.9896812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.9897296Z                            module_map=module_map)
2025-05-07T20:32:26.9897664Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.9898030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.9898296Z E       ^
2025-05-07T20:32:26.9898771Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.9899243Z 
2025-05-07T20:32:26.9899675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:26.9900205Z 
2025-05-07T20:32:26.9900311Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:26.9900740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:26.9901147Z     T=16384,
2025-05-07T20:32:26.9901350Z     D=7168,
2025-05-07T20:32:26.9901630Z     scale_ub=1200.0,
2025-05-07T20:32:26.9901860Z     contiguous=False,
2025-05-07T20:32:26.9902091Z     compiled=True,
2025-05-07T20:32:26.9902303Z )
2025-05-07T20:32:26.9902629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:26.9903198Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:26.9903493Z 
2025-05-07T20:32:26.9903571Z     @given(
2025-05-07T20:32:26.9903807Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:26.9904123Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:26.9904552Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:26.9904894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:26.9905226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:26.9905524Z     )
2025-05-07T20:32:26.9905886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:26.9906382Z     def test_silu_mul_quant(
2025-05-07T20:32:26.9906630Z         self,
2025-05-07T20:32:26.9906828Z         T: int,
2025-05-07T20:32:26.9907024Z         D: int,
2025-05-07T20:32:26.9907249Z         scale_ub: Optional[float],
2025-05-07T20:32:26.9907528Z         contiguous: bool,
2025-05-07T20:32:26.9907770Z         compiled: bool,
2025-05-07T20:32:26.9907997Z     ) -> None:
2025-05-07T20:32:26.9908221Z         torch.manual_seed(2025)
2025-05-07T20:32:26.9908470Z     
2025-05-07T20:32:26.9908746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:26.9909102Z     
2025-05-07T20:32:26.9909299Z         x_sign = torch.sign(x)
2025-05-07T20:32:26.9909594Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:26.9909912Z         x = x_sign * x_clamp
2025-05-07T20:32:26.9910160Z         x0 = x[:, :D]
2025-05-07T20:32:26.9910374Z         x1 = x[:, D:]
2025-05-07T20:32:26.9910591Z     
2025-05-07T20:32:26.9910783Z         if contiguous:
2025-05-07T20:32:26.9911020Z             x0 = x0.contiguous()
2025-05-07T20:32:26.9911285Z             x1 = x1.contiguous()
2025-05-07T20:32:26.9911528Z     
2025-05-07T20:32:26.9911716Z         if scale_ub is not None:
2025-05-07T20:32:26.9911991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:26.9912340Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:26.9912653Z             )
2025-05-07T20:32:26.9912849Z         else:
2025-05-07T20:32:26.9913073Z             scale_ub_tensor = None
2025-05-07T20:32:26.9913680Z     
2025-05-07T20:32:26.9913928Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:26.9914258Z             op = silu_mul_quant
2025-05-07T20:32:26.9914514Z             if compiled:
2025-05-07T20:32:26.9914766Z                 op = torch.compile(op)
2025-05-07T20:32:26.9915070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9915351Z     
2025-05-07T20:32:26.9915548Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:26.9915731Z 
2025-05-07T20:32:26.9915834Z moe/activation_test.py:117: 
2025-05-07T20:32:26.9916137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9916484Z moe/activation_test.py:115: in fn
2025-05-07T20:32:26.9916769Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:26.9917345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:26.9917921Z     return fn(*args, **kwargs)
2025-05-07T20:32:26.9918598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:26.9919320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:26.9919876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:26.9920675Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:26.9921513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:26.9922063Z     kernel = self.compile(
2025-05-07T20:32:26.9922621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:26.9923301Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:26.9923716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:26.9923955Z 
2025-05-07T20:32:26.9924171Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e6aea50>
2025-05-07T20:32:26.9925371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:26.9926850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7fe8d60>}
2025-05-07T20:32:26.9928231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:26.9929286Z context = <triton._C.libtriton.ir.context object at 0x7f30b7f34c30>
2025-05-07T20:32:26.9929591Z 
2025-05-07T20:32:26.9929765Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:26.9930309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:26.9930788Z                            module_map=module_map)
2025-05-07T20:32:26.9931166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:26.9931532Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:26.9931797Z E       ^
2025-05-07T20:32:26.9932282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:26.9932754Z 
2025-05-07T20:32:26.9933189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.1219933Z 
2025-05-07T20:32:27.1220641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.1221905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.1222994Z     T=1,
2025-05-07T20:32:27.1223252Z     D=7168,
2025-05-07T20:32:27.1223458Z     scale_ub=None,
2025-05-07T20:32:27.1223688Z     contiguous=False,
2025-05-07T20:32:27.1223925Z     compiled=False,
2025-05-07T20:32:27.1224146Z )
2025-05-07T20:32:27.1224474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.1224987Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:27.1225306Z 
2025-05-07T20:32:27.1225389Z     @given(
2025-05-07T20:32:27.1225631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.1225956Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.1226269Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.1226611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.1226953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.1227245Z     )
2025-05-07T20:32:27.1227608Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.1228065Z     def test_silu_mul_quant(
2025-05-07T20:32:27.1228319Z         self,
2025-05-07T20:32:27.1228514Z         T: int,
2025-05-07T20:32:27.1228717Z         D: int,
2025-05-07T20:32:27.1228945Z         scale_ub: Optional[float],
2025-05-07T20:32:27.1229222Z         contiguous: bool,
2025-05-07T20:32:27.1229470Z         compiled: bool,
2025-05-07T20:32:27.1229701Z     ) -> None:
2025-05-07T20:32:27.1229920Z         torch.manual_seed(2025)
2025-05-07T20:32:27.1230171Z     
2025-05-07T20:32:27.1230618Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.1230973Z     
2025-05-07T20:32:27.1231172Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.1231471Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.1231786Z         x = x_sign * x_clamp
2025-05-07T20:32:27.1232034Z         x0 = x[:, :D]
2025-05-07T20:32:27.1232258Z         x1 = x[:, D:]
2025-05-07T20:32:27.1232468Z     
2025-05-07T20:32:27.1232659Z         if contiguous:
2025-05-07T20:32:27.1232897Z             x0 = x0.contiguous()
2025-05-07T20:32:27.1233222Z             x1 = x1.contiguous()
2025-05-07T20:32:27.1233469Z     
2025-05-07T20:32:27.1233664Z         if scale_ub is not None:
2025-05-07T20:32:27.1233947Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.1234291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.1234673Z             )
2025-05-07T20:32:27.1234873Z         else:
2025-05-07T20:32:27.1235093Z             scale_ub_tensor = None
2025-05-07T20:32:27.1235351Z     
2025-05-07T20:32:27.1235592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.1235910Z             op = silu_mul_quant
2025-05-07T20:32:27.1236167Z             if compiled:
2025-05-07T20:32:27.1236422Z                 op = torch.compile(op)
2025-05-07T20:32:27.1236727Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1237013Z     
2025-05-07T20:32:27.1237212Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.1237382Z 
2025-05-07T20:32:27.1237483Z moe/activation_test.py:117: 
2025-05-07T20:32:27.1237793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1238133Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.1238424Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1239135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.1239856Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.1240516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.1241223Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.1241916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.1242465Z     kernel = self.compile(
2025-05-07T20:32:27.1243025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.1243705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.1244112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1244349Z 
2025-05-07T20:32:27.1244564Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f43a050>
2025-05-07T20:32:27.1245689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.1247101Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7fe9760>}
2025-05-07T20:32:27.1248490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.1249546Z context = <triton._C.libtriton.ir.context object at 0x7f31f8f7bef0>
2025-05-07T20:32:27.1249843Z 
2025-05-07T20:32:27.1250021Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.1250561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.1251133Z                            module_map=module_map)
2025-05-07T20:32:27.1251514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.1251881Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.1252145Z E       ^
2025-05-07T20:32:27.1252626Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.1253094Z 
2025-05-07T20:32:27.1253530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.1254105Z 
2025-05-07T20:32:27.1254217Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.1254644Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.1255058Z     T=2048,
2025-05-07T20:32:27.1255250Z     D=7168,
2025-05-07T20:32:27.1255443Z     scale_ub=None,
2025-05-07T20:32:27.1255708Z     contiguous=False,
2025-05-07T20:32:27.1255947Z     compiled=True,
2025-05-07T20:32:27.1256152Z )
2025-05-07T20:32:27.1256485Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.1257001Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:27.1257280Z 
2025-05-07T20:32:27.1257357Z     @given(
2025-05-07T20:32:27.1257597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.1257921Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.1258240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.1258578Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.1258922Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.1259218Z     )
2025-05-07T20:32:27.1259574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.1260029Z     def test_silu_mul_quant(
2025-05-07T20:32:27.1260280Z         self,
2025-05-07T20:32:27.1260475Z         T: int,
2025-05-07T20:32:27.1260683Z         D: int,
2025-05-07T20:32:27.1260909Z         scale_ub: Optional[float],
2025-05-07T20:32:27.1261186Z         contiguous: bool,
2025-05-07T20:32:27.1261436Z         compiled: bool,
2025-05-07T20:32:27.1261664Z     ) -> None:
2025-05-07T20:32:27.1261881Z         torch.manual_seed(2025)
2025-05-07T20:32:27.1262129Z     
2025-05-07T20:32:27.1262410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.1262763Z     
2025-05-07T20:32:27.1262959Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.1263259Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.1263583Z         x = x_sign * x_clamp
2025-05-07T20:32:27.1263828Z         x0 = x[:, :D]
2025-05-07T20:32:27.1264050Z         x1 = x[:, D:]
2025-05-07T20:32:27.1264264Z     
2025-05-07T20:32:27.1264452Z         if contiguous:
2025-05-07T20:32:27.1264691Z             x0 = x0.contiguous()
2025-05-07T20:32:27.1264963Z             x1 = x1.contiguous()
2025-05-07T20:32:27.1265209Z     
2025-05-07T20:32:27.1265411Z         if scale_ub is not None:
2025-05-07T20:32:27.1265695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.1266037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.1266360Z             )
2025-05-07T20:32:27.1266560Z         else:
2025-05-07T20:32:27.1266771Z             scale_ub_tensor = None
2025-05-07T20:32:27.1267032Z     
2025-05-07T20:32:27.1267272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.1267595Z             op = silu_mul_quant
2025-05-07T20:32:27.1267857Z             if compiled:
2025-05-07T20:32:27.1268115Z                 op = torch.compile(op)
2025-05-07T20:32:27.1268420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1268702Z     
2025-05-07T20:32:27.1268901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.1269077Z 
2025-05-07T20:32:27.1269178Z moe/activation_test.py:117: 
2025-05-07T20:32:27.1269564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1269912Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.1270201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.1270776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.1271352Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.1272034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.1272740Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.1273351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.1274059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.1274745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.1275342Z     kernel = self.compile(
2025-05-07T20:32:27.1275900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.1276580Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.1276988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.1277230Z 
2025-05-07T20:32:27.1277441Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9802750>
2025-05-07T20:32:27.1278559Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.1279971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7feaf20>}
2025-05-07T20:32:27.1281425Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.1282484Z context = <triton._C.libtriton.ir.context object at 0x7f31f8f39cb0>
2025-05-07T20:32:27.1282787Z 
2025-05-07T20:32:27.1282958Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.1283501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.1283981Z                            module_map=module_map)
2025-05-07T20:32:27.1284361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.1284730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.1284998Z E       ^
2025-05-07T20:32:27.1285473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.1285948Z 
2025-05-07T20:32:27.1286385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.1286915Z 
2025-05-07T20:32:27.1287026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.1287451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.1287869Z     T=4096,
2025-05-07T20:32:27.1288062Z     D=7168,
2025-05-07T20:32:27.1288263Z     scale_ub=None,
2025-05-07T20:32:27.1288485Z     contiguous=False,
2025-05-07T20:32:27.1288715Z     compiled=True,
2025-05-07T20:32:27.3521557Z )
2025-05-07T20:32:27.3522550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.3523667Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:27.3524077Z 
2025-05-07T20:32:27.3524186Z     @given(
2025-05-07T20:32:27.3532997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.3533467Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.3533981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.3534334Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.3534674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.3534962Z     )
2025-05-07T20:32:27.3535326Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.3535793Z     def test_silu_mul_quant(
2025-05-07T20:32:27.3536045Z         self,
2025-05-07T20:32:27.3536241Z         T: int,
2025-05-07T20:32:27.3536441Z         D: int,
2025-05-07T20:32:27.3536729Z         scale_ub: Optional[float],
2025-05-07T20:32:27.3537006Z         contiguous: bool,
2025-05-07T20:32:27.3537255Z         compiled: bool,
2025-05-07T20:32:27.3537485Z     ) -> None:
2025-05-07T20:32:27.3537699Z         torch.manual_seed(2025)
2025-05-07T20:32:27.3537953Z     
2025-05-07T20:32:27.3538234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.3538652Z     
2025-05-07T20:32:27.3538858Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.3539158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.3539471Z         x = x_sign * x_clamp
2025-05-07T20:32:27.3539714Z         x0 = x[:, :D]
2025-05-07T20:32:27.3539933Z         x1 = x[:, D:]
2025-05-07T20:32:27.3540142Z     
2025-05-07T20:32:27.3540331Z         if contiguous:
2025-05-07T20:32:27.3540571Z             x0 = x0.contiguous()
2025-05-07T20:32:27.3540835Z             x1 = x1.contiguous()
2025-05-07T20:32:27.3541080Z     
2025-05-07T20:32:27.3541280Z         if scale_ub is not None:
2025-05-07T20:32:27.3541560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.3541903Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.3542225Z             )
2025-05-07T20:32:27.3542429Z         else:
2025-05-07T20:32:27.3542641Z             scale_ub_tensor = None
2025-05-07T20:32:27.3542907Z     
2025-05-07T20:32:27.3543146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.3543465Z             op = silu_mul_quant
2025-05-07T20:32:27.3543725Z             if compiled:
2025-05-07T20:32:27.3543976Z                 op = torch.compile(op)
2025-05-07T20:32:27.3544278Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3544563Z     
2025-05-07T20:32:27.3544760Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.3544928Z 
2025-05-07T20:32:27.3545030Z moe/activation_test.py:117: 
2025-05-07T20:32:27.3545332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3545679Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.3545975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3546557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.3547144Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.3547836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.3548546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.3549105Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.3549819Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.3550513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.3551058Z     kernel = self.compile(
2025-05-07T20:32:27.3551623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.3552307Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.3552719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3552961Z 
2025-05-07T20:32:27.3553260Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8853cd0>
2025-05-07T20:32:27.3554393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.3555826Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fd80e0>}
2025-05-07T20:32:27.3557221Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.3558324Z context = <triton._C.libtriton.ir.context object at 0x7f31f8fca7b0>
2025-05-07T20:32:27.3558630Z 
2025-05-07T20:32:27.3558805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.3559402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.3559892Z                            module_map=module_map)
2025-05-07T20:32:27.3560403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.3560770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.3561038Z E       ^
2025-05-07T20:32:27.3561516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.3561991Z 
2025-05-07T20:32:27.3562422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.3562964Z 
2025-05-07T20:32:27.3563075Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.3563506Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.3563924Z     T=16384,
2025-05-07T20:32:27.3564128Z     D=5120,
2025-05-07T20:32:27.3564327Z     scale_ub=1200.0,
2025-05-07T20:32:27.3564558Z     contiguous=False,
2025-05-07T20:32:27.3564789Z     compiled=False,
2025-05-07T20:32:27.3565001Z )
2025-05-07T20:32:27.3565327Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.3565857Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:27.3566160Z 
2025-05-07T20:32:27.3566238Z     @given(
2025-05-07T20:32:27.3566478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.3566797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.3567123Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.3567471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.3567813Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.3568106Z     )
2025-05-07T20:32:27.3568468Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.3568923Z     def test_silu_mul_quant(
2025-05-07T20:32:27.3569173Z         self,
2025-05-07T20:32:27.3569375Z         T: int,
2025-05-07T20:32:27.3569575Z         D: int,
2025-05-07T20:32:27.3569793Z         scale_ub: Optional[float],
2025-05-07T20:32:27.3570068Z         contiguous: bool,
2025-05-07T20:32:27.3570315Z         compiled: bool,
2025-05-07T20:32:27.3570538Z     ) -> None:
2025-05-07T20:32:27.3570757Z         torch.manual_seed(2025)
2025-05-07T20:32:27.3571004Z     
2025-05-07T20:32:27.3571280Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.3571636Z     
2025-05-07T20:32:27.3571840Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.3572136Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.3572455Z         x = x_sign * x_clamp
2025-05-07T20:32:27.3572704Z         x0 = x[:, :D]
2025-05-07T20:32:27.3572922Z         x1 = x[:, D:]
2025-05-07T20:32:27.3573148Z     
2025-05-07T20:32:27.3573337Z         if contiguous:
2025-05-07T20:32:27.3573573Z             x0 = x0.contiguous()
2025-05-07T20:32:27.3573918Z             x1 = x1.contiguous()
2025-05-07T20:32:27.3574173Z     
2025-05-07T20:32:27.3574363Z         if scale_ub is not None:
2025-05-07T20:32:27.3574649Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.3574995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.3575318Z             )
2025-05-07T20:32:27.3575513Z         else:
2025-05-07T20:32:27.3575732Z             scale_ub_tensor = None
2025-05-07T20:32:27.3575990Z     
2025-05-07T20:32:27.3576224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.3576590Z             op = silu_mul_quant
2025-05-07T20:32:27.3576846Z             if compiled:
2025-05-07T20:32:27.3577099Z                 op = torch.compile(op)
2025-05-07T20:32:27.3577402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3577686Z     
2025-05-07T20:32:27.3577884Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.3578107Z 
2025-05-07T20:32:27.3578207Z moe/activation_test.py:117: 
2025-05-07T20:32:27.3578522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3578869Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.3579155Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3579869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.3580583Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.3581138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.3581853Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.3582545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.3583096Z     kernel = self.compile(
2025-05-07T20:32:27.3583663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.3584346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.3584758Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3584997Z 
2025-05-07T20:32:27.3585210Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e6acc50>
2025-05-07T20:32:27.3586329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.3587755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fd8b80>}
2025-05-07T20:32:27.3589153Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.3590218Z context = <triton._C.libtriton.ir.context object at 0x7f31f8ad22b0>
2025-05-07T20:32:27.3590519Z 
2025-05-07T20:32:27.3590692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.3591235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.3591715Z                            module_map=module_map)
2025-05-07T20:32:27.3592090Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.3592453Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.3592725Z E       ^
2025-05-07T20:32:27.3593214Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.3593727Z 
2025-05-07T20:32:27.3594160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.3594780Z 
2025-05-07T20:32:27.3594888Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.3595322Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.3595741Z     T=16384,
2025-05-07T20:32:27.3595935Z     D=5120,
2025-05-07T20:32:27.3596132Z     scale_ub=1200.0,
2025-05-07T20:32:27.3596356Z     contiguous=True,
2025-05-07T20:32:27.3596576Z     compiled=True,
2025-05-07T20:32:27.3596781Z )
2025-05-07T20:32:27.3597108Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.3597689Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:27.3597983Z 
2025-05-07T20:32:27.3598063Z     @given(
2025-05-07T20:32:27.3598301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.3598626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.3598981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.3599326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.3599667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.3599961Z     )
2025-05-07T20:32:27.3600405Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.3600858Z     def test_silu_mul_quant(
2025-05-07T20:32:27.3601101Z         self,
2025-05-07T20:32:27.3601298Z         T: int,
2025-05-07T20:32:27.3601497Z         D: int,
2025-05-07T20:32:27.3601717Z         scale_ub: Optional[float],
2025-05-07T20:32:27.3601993Z         contiguous: bool,
2025-05-07T20:32:27.3602243Z         compiled: bool,
2025-05-07T20:32:27.3602469Z     ) -> None:
2025-05-07T20:32:27.3602683Z         torch.manual_seed(2025)
2025-05-07T20:32:27.3602933Z     
2025-05-07T20:32:27.3603212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.3603558Z     
2025-05-07T20:32:27.3603758Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.3604062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.3604377Z         x = x_sign * x_clamp
2025-05-07T20:32:27.3604622Z         x0 = x[:, :D]
2025-05-07T20:32:27.3604841Z         x1 = x[:, D:]
2025-05-07T20:32:27.3605048Z     
2025-05-07T20:32:27.3605237Z         if contiguous:
2025-05-07T20:32:27.3605470Z             x0 = x0.contiguous()
2025-05-07T20:32:27.3605731Z             x1 = x1.contiguous()
2025-05-07T20:32:27.3605980Z     
2025-05-07T20:32:27.3606179Z         if scale_ub is not None:
2025-05-07T20:32:27.3606460Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.3606813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.3607129Z             )
2025-05-07T20:32:27.3607324Z         else:
2025-05-07T20:32:27.3607535Z             scale_ub_tensor = None
2025-05-07T20:32:27.3607792Z     
2025-05-07T20:32:27.3608028Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.3608369Z             op = silu_mul_quant
2025-05-07T20:32:27.3608630Z             if compiled:
2025-05-07T20:32:27.3608879Z                 op = torch.compile(op)
2025-05-07T20:32:27.3609180Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3609461Z     
2025-05-07T20:32:27.3609656Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.3609826Z 
2025-05-07T20:32:27.3609931Z moe/activation_test.py:117: 
2025-05-07T20:32:27.3610227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3610577Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.3610868Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.3611446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.3612023Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.3612704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.3613609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.3614293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.3615009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.3615702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.3616254Z     kernel = self.compile(
2025-05-07T20:32:27.3616819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.3617564Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.3617976Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.3618214Z 
2025-05-07T20:32:27.3618430Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f961cfd0>
2025-05-07T20:32:27.3619615Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.3621039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fda2a0>}
2025-05-07T20:32:27.3622430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.3623545Z context = <triton._C.libtriton.ir.context object at 0x7f31f8a85330>
2025-05-07T20:32:27.3623844Z 
2025-05-07T20:32:27.3624016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.3624561Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.3625059Z                            module_map=module_map)
2025-05-07T20:32:27.3625434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.3625796Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.3626061Z E       ^
2025-05-07T20:32:27.3626539Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.3627007Z 
2025-05-07T20:32:27.3627442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.5148599Z 
2025-05-07T20:32:27.5148759Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5149271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5149853Z     T=16384,
2025-05-07T20:32:27.5150139Z     D=5120,
2025-05-07T20:32:27.5150379Z     scale_ub=None,
2025-05-07T20:32:27.5150711Z     contiguous=False,
2025-05-07T20:32:27.5150952Z     compiled=True,
2025-05-07T20:32:27.5151161Z )
2025-05-07T20:32:27.5151501Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.5152021Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:27.5152317Z 
2025-05-07T20:32:27.5152399Z     @given(
2025-05-07T20:32:27.5152643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.5153030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.5153422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.5153853Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.5154287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.5154657Z     )
2025-05-07T20:32:27.5155035Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.5155493Z     def test_silu_mul_quant(
2025-05-07T20:32:27.5155743Z         self,
2025-05-07T20:32:27.5155946Z         T: int,
2025-05-07T20:32:27.5156152Z         D: int,
2025-05-07T20:32:27.5156545Z         scale_ub: Optional[float],
2025-05-07T20:32:27.5156829Z         contiguous: bool,
2025-05-07T20:32:27.5157079Z         compiled: bool,
2025-05-07T20:32:27.5157311Z     ) -> None:
2025-05-07T20:32:27.5157529Z         torch.manual_seed(2025)
2025-05-07T20:32:27.5157780Z     
2025-05-07T20:32:27.5158063Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.5158417Z     
2025-05-07T20:32:27.5158620Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.5158925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.5159312Z         x = x_sign * x_clamp
2025-05-07T20:32:27.5159558Z         x0 = x[:, :D]
2025-05-07T20:32:27.5159782Z         x1 = x[:, D:]
2025-05-07T20:32:27.5159996Z     
2025-05-07T20:32:27.5160276Z         if contiguous:
2025-05-07T20:32:27.5160515Z             x0 = x0.contiguous()
2025-05-07T20:32:27.5160784Z             x1 = x1.contiguous()
2025-05-07T20:32:27.5161096Z     
2025-05-07T20:32:27.5161309Z         if scale_ub is not None:
2025-05-07T20:32:27.5161595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.5161942Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.5162267Z             )
2025-05-07T20:32:27.5162466Z         else:
2025-05-07T20:32:27.5162680Z             scale_ub_tensor = None
2025-05-07T20:32:27.5162943Z     
2025-05-07T20:32:27.5163213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.5163560Z             op = silu_mul_quant
2025-05-07T20:32:27.5163821Z             if compiled:
2025-05-07T20:32:27.5164077Z                 op = torch.compile(op)
2025-05-07T20:32:27.5164390Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5164672Z     
2025-05-07T20:32:27.5164872Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.5165041Z 
2025-05-07T20:32:27.5165150Z moe/activation_test.py:117: 
2025-05-07T20:32:27.5165454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5165809Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.5166102Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5166682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.5167266Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.5167949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.5168661Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.5169215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.5169924Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.5170621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.5171171Z     kernel = self.compile(
2025-05-07T20:32:27.5171737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.5172423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.5172842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5173083Z 
2025-05-07T20:32:27.5173299Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f04a8d0>
2025-05-07T20:32:27.5174421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.5175851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fdb060>}
2025-05-07T20:32:27.5177335Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.5178405Z context = <triton._C.libtriton.ir.context object at 0x7f31f8ac23f0>
2025-05-07T20:32:27.5178710Z 
2025-05-07T20:32:27.5178886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.5179441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.5179931Z                            module_map=module_map)
2025-05-07T20:32:27.5180351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.5180723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.5180994Z E       ^
2025-05-07T20:32:27.5181477Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.5181986Z 
2025-05-07T20:32:27.5182424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.5182961Z 
2025-05-07T20:32:27.5183070Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.5183508Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.5183929Z     T=2048,
2025-05-07T20:32:27.5184119Z     D=5120,
2025-05-07T20:32:27.5184319Z     scale_ub=None,
2025-05-07T20:32:27.5184545Z     contiguous=False,
2025-05-07T20:32:27.5184776Z     compiled=True,
2025-05-07T20:32:27.5184991Z )
2025-05-07T20:32:27.5185326Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.5185836Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:27.5186123Z 
2025-05-07T20:32:27.5186204Z     @given(
2025-05-07T20:32:27.5186442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.5186773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.5187093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.5187437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.5187782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.5188078Z     )
2025-05-07T20:32:27.5188444Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.5188927Z     def test_silu_mul_quant(
2025-05-07T20:32:27.5189178Z         self,
2025-05-07T20:32:27.5189373Z         T: int,
2025-05-07T20:32:27.5189575Z         D: int,
2025-05-07T20:32:27.5189804Z         scale_ub: Optional[float],
2025-05-07T20:32:27.5190086Z         contiguous: bool,
2025-05-07T20:32:27.5190336Z         compiled: bool,
2025-05-07T20:32:27.5190574Z     ) -> None:
2025-05-07T20:32:27.5190798Z         torch.manual_seed(2025)
2025-05-07T20:32:27.5191050Z     
2025-05-07T20:32:27.5191335Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.5191699Z     
2025-05-07T20:32:27.5191895Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.5192200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.5192524Z         x = x_sign * x_clamp
2025-05-07T20:32:27.5192771Z         x0 = x[:, :D]
2025-05-07T20:32:27.5192997Z         x1 = x[:, D:]
2025-05-07T20:32:27.5193234Z     
2025-05-07T20:32:27.5193448Z         if contiguous:
2025-05-07T20:32:27.5193689Z             x0 = x0.contiguous()
2025-05-07T20:32:27.5193956Z             x1 = x1.contiguous()
2025-05-07T20:32:27.5194202Z     
2025-05-07T20:32:27.5194400Z         if scale_ub is not None:
2025-05-07T20:32:27.5194690Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.5195045Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.5195371Z             )
2025-05-07T20:32:27.5195563Z         else:
2025-05-07T20:32:27.5195779Z             scale_ub_tensor = None
2025-05-07T20:32:27.5196038Z     
2025-05-07T20:32:27.5196278Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.5196710Z             op = silu_mul_quant
2025-05-07T20:32:27.5196976Z             if compiled:
2025-05-07T20:32:27.5197229Z                 op = torch.compile(op)
2025-05-07T20:32:27.5197536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5197820Z     
2025-05-07T20:32:27.5198014Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.5198192Z 
2025-05-07T20:32:27.5198293Z moe/activation_test.py:117: 
2025-05-07T20:32:27.5198600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5198942Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.5199282Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.5199870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.5200549Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.5201231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.5209485Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.5210076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.5210801Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.5211497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.5212058Z     kernel = self.compile(
2025-05-07T20:32:27.5212628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.5213621Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.5214047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.5214294Z 
2025-05-07T20:32:27.5214513Z self = <triton.compiler.compiler.ASTSource object at 0x7f328eccbd50>
2025-05-07T20:32:27.5215641Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.5217070Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1c7c0>}
2025-05-07T20:32:27.5218465Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.5219524Z context = <triton._C.libtriton.ir.context object at 0x7f30b7d3da70>
2025-05-07T20:32:27.5219829Z 
2025-05-07T20:32:27.5220007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.5220563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.5221052Z                            module_map=module_map)
2025-05-07T20:32:27.5221437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.5221809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.5222082Z E       ^
2025-05-07T20:32:27.5222565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.5223036Z 
2025-05-07T20:32:27.5223467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.8794056Z 
2025-05-07T20:32:27.8794311Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8794821Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8795277Z     T=2048,
2025-05-07T20:32:27.8795474Z     D=5120,
2025-05-07T20:32:27.8795680Z     scale_ub=1200.0,
2025-05-07T20:32:27.8795912Z     contiguous=False,
2025-05-07T20:32:27.8796316Z     compiled=True,
2025-05-07T20:32:27.8796535Z )
2025-05-07T20:32:27.8796866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8797383Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:27.8797667Z 
2025-05-07T20:32:27.8797754Z     @given(
2025-05-07T20:32:27.8797990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8798316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8798637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8799039Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8799384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8799687Z     )
2025-05-07T20:32:27.8800057Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8800589Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8800912Z         self,
2025-05-07T20:32:27.8801116Z         T: int,
2025-05-07T20:32:27.8801324Z         D: int,
2025-05-07T20:32:27.8801551Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8801836Z         contiguous: bool,
2025-05-07T20:32:27.8802084Z         compiled: bool,
2025-05-07T20:32:27.8802319Z     ) -> None:
2025-05-07T20:32:27.8802547Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8802796Z     
2025-05-07T20:32:27.8803083Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8803444Z     
2025-05-07T20:32:27.8803645Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.8803950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.8804284Z         x = x_sign * x_clamp
2025-05-07T20:32:27.8804534Z         x0 = x[:, :D]
2025-05-07T20:32:27.8804755Z         x1 = x[:, D:]
2025-05-07T20:32:27.8804974Z     
2025-05-07T20:32:27.8805169Z         if contiguous:
2025-05-07T20:32:27.8805405Z             x0 = x0.contiguous()
2025-05-07T20:32:27.8805683Z             x1 = x1.contiguous()
2025-05-07T20:32:27.8805938Z     
2025-05-07T20:32:27.8806130Z         if scale_ub is not None:
2025-05-07T20:32:27.8806420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.8806768Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.8807087Z             )
2025-05-07T20:32:27.8807287Z         else:
2025-05-07T20:32:27.8807506Z             scale_ub_tensor = None
2025-05-07T20:32:27.8807762Z     
2025-05-07T20:32:27.8808004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.8808331Z             op = silu_mul_quant
2025-05-07T20:32:27.8808592Z             if compiled:
2025-05-07T20:32:27.8808849Z                 op = torch.compile(op)
2025-05-07T20:32:27.8809163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.8809443Z     
2025-05-07T20:32:27.8809645Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.8809821Z 
2025-05-07T20:32:27.8809923Z moe/activation_test.py:117: 
2025-05-07T20:32:27.8810241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8810591Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.8810887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.8811478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.8812054Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.8812740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.8813628Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.8814193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.8814898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.8815592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.8816270Z     kernel = self.compile(
2025-05-07T20:32:27.8816838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.8817519Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.8817935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8818174Z 
2025-05-07T20:32:27.8818396Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8c4b0d0>
2025-05-07T20:32:27.8819513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.8821004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1d580>}
2025-05-07T20:32:27.8822463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.8823570Z context = <triton._C.libtriton.ir.context object at 0x7f30b7d029b0>
2025-05-07T20:32:27.8823882Z 
2025-05-07T20:32:27.8824062Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.8824606Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.8825099Z                            module_map=module_map)
2025-05-07T20:32:27.8825485Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.8825850Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.8826124Z E       ^
2025-05-07T20:32:27.8826618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.8827091Z 
2025-05-07T20:32:27.8827531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:27.8828062Z 
2025-05-07T20:32:27.8828171Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:27.8828601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:27.8829022Z     T=4096,
2025-05-07T20:32:27.8829214Z     D=5120,
2025-05-07T20:32:27.8829415Z     scale_ub=1200.0,
2025-05-07T20:32:27.8829647Z     contiguous=True,
2025-05-07T20:32:27.8829872Z     compiled=True,
2025-05-07T20:32:27.8830087Z )
2025-05-07T20:32:27.8830422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:27.8830937Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:27.8831217Z 
2025-05-07T20:32:27.8831296Z     @given(
2025-05-07T20:32:27.8831533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:27.8831865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:27.8832183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:27.8832525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:27.8832870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:27.8833165Z     )
2025-05-07T20:32:27.8833533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:27.8834043Z     def test_silu_mul_quant(
2025-05-07T20:32:27.8834294Z         self,
2025-05-07T20:32:27.8834491Z         T: int,
2025-05-07T20:32:27.8834697Z         D: int,
2025-05-07T20:32:27.8834937Z         scale_ub: Optional[float],
2025-05-07T20:32:27.8835219Z         contiguous: bool,
2025-05-07T20:32:27.8835470Z         compiled: bool,
2025-05-07T20:32:27.8835708Z     ) -> None:
2025-05-07T20:32:27.8835930Z         torch.manual_seed(2025)
2025-05-07T20:32:27.8836183Z     
2025-05-07T20:32:27.8836472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:27.8836905Z     
2025-05-07T20:32:27.8837113Z         x_sign = torch.sign(x)
2025-05-07T20:32:27.8837419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:27.8837740Z         x = x_sign * x_clamp
2025-05-07T20:32:27.8837994Z         x0 = x[:, :D]
2025-05-07T20:32:27.8838220Z         x1 = x[:, D:]
2025-05-07T20:32:27.8838435Z     
2025-05-07T20:32:27.8838631Z         if contiguous:
2025-05-07T20:32:27.8838874Z             x0 = x0.contiguous()
2025-05-07T20:32:27.8839142Z             x1 = x1.contiguous()
2025-05-07T20:32:27.8839390Z     
2025-05-07T20:32:27.8839640Z         if scale_ub is not None:
2025-05-07T20:32:27.8839919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:27.8840343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:27.8840664Z             )
2025-05-07T20:32:27.8840863Z         else:
2025-05-07T20:32:27.8841080Z             scale_ub_tensor = None
2025-05-07T20:32:27.8841391Z     
2025-05-07T20:32:27.8841637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:27.8841971Z             op = silu_mul_quant
2025-05-07T20:32:27.8842230Z             if compiled:
2025-05-07T20:32:27.8842483Z                 op = torch.compile(op)
2025-05-07T20:32:27.8842793Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.8843081Z     
2025-05-07T20:32:27.8843277Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:27.8843453Z 
2025-05-07T20:32:27.8843555Z moe/activation_test.py:117: 
2025-05-07T20:32:27.8843861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8844209Z moe/activation_test.py:115: in fn
2025-05-07T20:32:27.8844501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:27.8845081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:27.8845661Z     return fn(*args, **kwargs)
2025-05-07T20:32:27.8846346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:27.8847058Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:27.8847617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:27.8848326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:27.8849012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:27.8849566Z     kernel = self.compile(
2025-05-07T20:32:27.8850138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:27.8850818Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:27.8851228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:27.8851475Z 
2025-05-07T20:32:27.8851693Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a1fd0>
2025-05-07T20:32:27.8852819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:27.8854251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1e840>}
2025-05-07T20:32:27.8855642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:27.8856708Z context = <triton._C.libtriton.ir.context object at 0x7f30b7ef0930>
2025-05-07T20:32:27.8857013Z 
2025-05-07T20:32:27.8857184Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:27.8857873Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:27.8858358Z                            module_map=module_map)
2025-05-07T20:32:27.8858740Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:27.8859110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:27.8859375Z E       ^
2025-05-07T20:32:27.8859860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:27.8860333Z 
2025-05-07T20:32:27.8860767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.0540761Z 
2025-05-07T20:32:28.0541095Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.0541553Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.0542049Z     T=128,
2025-05-07T20:32:28.0542421Z     D=5120,
2025-05-07T20:32:28.0542626Z     scale_ub=1200.0,
2025-05-07T20:32:28.0542871Z     contiguous=False,
2025-05-07T20:32:28.0543105Z     compiled=True,
2025-05-07T20:32:28.0543321Z )
2025-05-07T20:32:28.0543685Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.0544230Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.0544513Z 
2025-05-07T20:32:28.0544596Z     @given(
2025-05-07T20:32:28.0544837Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.0545165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.0545487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.0545839Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.0546185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.0546480Z     )
2025-05-07T20:32:28.0546847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.0547319Z     def test_silu_mul_quant(
2025-05-07T20:32:28.0547571Z         self,
2025-05-07T20:32:28.0547770Z         T: int,
2025-05-07T20:32:28.0547978Z         D: int,
2025-05-07T20:32:28.0548208Z         scale_ub: Optional[float],
2025-05-07T20:32:28.0548487Z         contiguous: bool,
2025-05-07T20:32:28.0548744Z         compiled: bool,
2025-05-07T20:32:28.0548978Z     ) -> None:
2025-05-07T20:32:28.0549197Z         torch.manual_seed(2025)
2025-05-07T20:32:28.0549450Z     
2025-05-07T20:32:28.0549735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.0550090Z     
2025-05-07T20:32:28.0550296Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.0550603Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.0550927Z         x = x_sign * x_clamp
2025-05-07T20:32:28.0551186Z         x0 = x[:, :D]
2025-05-07T20:32:28.0551415Z         x1 = x[:, D:]
2025-05-07T20:32:28.0551632Z     
2025-05-07T20:32:28.0551835Z         if contiguous:
2025-05-07T20:32:28.0552080Z             x0 = x0.contiguous()
2025-05-07T20:32:28.0552352Z             x1 = x1.contiguous()
2025-05-07T20:32:28.0552607Z     
2025-05-07T20:32:28.0552811Z         if scale_ub is not None:
2025-05-07T20:32:28.0553095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.0553444Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.0553764Z             )
2025-05-07T20:32:28.0553963Z         else:
2025-05-07T20:32:28.0554174Z             scale_ub_tensor = None
2025-05-07T20:32:28.0554434Z     
2025-05-07T20:32:28.0554672Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.0554997Z             op = silu_mul_quant
2025-05-07T20:32:28.0555253Z             if compiled:
2025-05-07T20:32:28.0555510Z                 op = torch.compile(op)
2025-05-07T20:32:28.0555814Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0556101Z     
2025-05-07T20:32:28.0556301Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.0556475Z 
2025-05-07T20:32:28.0556577Z moe/activation_test.py:117: 
2025-05-07T20:32:28.0557005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0557356Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.0557650Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0558225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.0558806Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.0559487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.0560349Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.0560909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.0561623Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.0562368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.0562915Z     kernel = self.compile(
2025-05-07T20:32:28.0563482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.0564164Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.0564580Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0564818Z 
2025-05-07T20:32:28.0565034Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecc8cd0>
2025-05-07T20:32:28.0566161Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.0567590Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1f4c0>}
2025-05-07T20:32:28.0568986Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.0570041Z context = <triton._C.libtriton.ir.context object at 0x7f30b7e479f0>
2025-05-07T20:32:28.0570346Z 
2025-05-07T20:32:28.0570523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.0571070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.0571565Z                            module_map=module_map)
2025-05-07T20:32:28.0571943Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.0572312Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.0572585Z E       ^
2025-05-07T20:32:28.0573073Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.0573545Z 
2025-05-07T20:32:28.0573978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.0574514Z 
2025-05-07T20:32:28.0574623Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.0575056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.0575470Z     T=16384,
2025-05-07T20:32:28.0575674Z     D=7168,
2025-05-07T20:32:28.0575876Z     scale_ub=1200.0,
2025-05-07T20:32:28.0576104Z     contiguous=True,
2025-05-07T20:32:28.0576336Z     compiled=True,
2025-05-07T20:32:28.0576548Z )
2025-05-07T20:32:28.0576877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.0577400Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.0577695Z 
2025-05-07T20:32:28.0577777Z     @given(
2025-05-07T20:32:28.0578103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.0578430Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.0578749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.0579096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.0579436Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.0579736Z     )
2025-05-07T20:32:28.0580101Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.0580558Z     def test_silu_mul_quant(
2025-05-07T20:32:28.0580812Z         self,
2025-05-07T20:32:28.0581062Z         T: int,
2025-05-07T20:32:28.0581265Z         D: int,
2025-05-07T20:32:28.0581495Z         scale_ub: Optional[float],
2025-05-07T20:32:28.0581782Z         contiguous: bool,
2025-05-07T20:32:28.0582035Z         compiled: bool,
2025-05-07T20:32:28.0582264Z     ) -> None:
2025-05-07T20:32:28.0582493Z         torch.manual_seed(2025)
2025-05-07T20:32:28.0582791Z     
2025-05-07T20:32:28.0583082Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.0583439Z     
2025-05-07T20:32:28.0583644Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.0583943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.0584265Z         x = x_sign * x_clamp
2025-05-07T20:32:28.0584517Z         x0 = x[:, :D]
2025-05-07T20:32:28.0584736Z         x1 = x[:, D:]
2025-05-07T20:32:28.0584952Z     
2025-05-07T20:32:28.0585143Z         if contiguous:
2025-05-07T20:32:28.0585378Z             x0 = x0.contiguous()
2025-05-07T20:32:28.0585645Z             x1 = x1.contiguous()
2025-05-07T20:32:28.0585899Z     
2025-05-07T20:32:28.0586090Z         if scale_ub is not None:
2025-05-07T20:32:28.0586377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.0586724Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.0587044Z             )
2025-05-07T20:32:28.0587243Z         else:
2025-05-07T20:32:28.0587459Z             scale_ub_tensor = None
2025-05-07T20:32:28.0587723Z     
2025-05-07T20:32:28.0587962Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.0588289Z             op = silu_mul_quant
2025-05-07T20:32:28.0588552Z             if compiled:
2025-05-07T20:32:28.0588808Z                 op = torch.compile(op)
2025-05-07T20:32:28.0589112Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0589396Z     
2025-05-07T20:32:28.0589596Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.0589765Z 
2025-05-07T20:32:28.0589867Z moe/activation_test.py:117: 
2025-05-07T20:32:28.0590180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0590533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.0590824Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.0591406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.0591989Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.0592678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.0593391Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.0593951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.0594663Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.0595351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.0595907Z     kernel = self.compile(
2025-05-07T20:32:28.0596471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.0597152Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.0597564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.0597808Z 
2025-05-07T20:32:28.0598105Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9d135d0>
2025-05-07T20:32:28.0599225Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.0600737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e34c20>}
2025-05-07T20:32:28.0602170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.0603238Z context = <triton._C.libtriton.ir.context object at 0x7f30b7e1b3f0>
2025-05-07T20:32:28.0603580Z 
2025-05-07T20:32:28.0603759Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.0604308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.0604790Z                            module_map=module_map)
2025-05-07T20:32:28.0605167Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.0605529Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.0605802Z E       ^
2025-05-07T20:32:28.0606287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.0606756Z 
2025-05-07T20:32:28.0607193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.1756033Z 
2025-05-07T20:32:28.1756472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.1757141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.1768302Z     T=16384,
2025-05-07T20:32:28.1768524Z     D=5120,
2025-05-07T20:32:28.1768730Z     scale_ub=1200.0,
2025-05-07T20:32:28.1768957Z     contiguous=True,
2025-05-07T20:32:28.1769190Z     compiled=False,
2025-05-07T20:32:28.1769407Z )
2025-05-07T20:32:28.1769738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.1770252Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.1770551Z 
2025-05-07T20:32:28.1770630Z     @given(
2025-05-07T20:32:28.1770869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.1771190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.1771507Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.1771848Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.1772181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.1772476Z     )
2025-05-07T20:32:28.1772842Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.1773306Z     def test_silu_mul_quant(
2025-05-07T20:32:28.1773549Z         self,
2025-05-07T20:32:28.1773750Z         T: int,
2025-05-07T20:32:28.1774081Z         D: int,
2025-05-07T20:32:28.1774303Z         scale_ub: Optional[float],
2025-05-07T20:32:28.1774577Z         contiguous: bool,
2025-05-07T20:32:28.1774867Z         compiled: bool,
2025-05-07T20:32:28.1775101Z     ) -> None:
2025-05-07T20:32:28.1775321Z         torch.manual_seed(2025)
2025-05-07T20:32:28.1775565Z     
2025-05-07T20:32:28.1775844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.1776201Z     
2025-05-07T20:32:28.1776396Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.1776696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.1777015Z         x = x_sign * x_clamp
2025-05-07T20:32:28.1777256Z         x0 = x[:, :D]
2025-05-07T20:32:28.1777490Z         x1 = x[:, D:]
2025-05-07T20:32:28.1777719Z     
2025-05-07T20:32:28.1778199Z         if contiguous:
2025-05-07T20:32:28.1778461Z             x0 = x0.contiguous()
2025-05-07T20:32:28.1778727Z             x1 = x1.contiguous()
2025-05-07T20:32:28.1778969Z     
2025-05-07T20:32:28.1779166Z         if scale_ub is not None:
2025-05-07T20:32:28.1779445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.1779785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.1780101Z             )
2025-05-07T20:32:28.1780303Z         else:
2025-05-07T20:32:28.1780512Z             scale_ub_tensor = None
2025-05-07T20:32:28.1780904Z     
2025-05-07T20:32:28.1781214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.1781540Z             op = silu_mul_quant
2025-05-07T20:32:28.1781791Z             if compiled:
2025-05-07T20:32:28.1782047Z                 op = torch.compile(op)
2025-05-07T20:32:28.1782352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.1782701Z     
2025-05-07T20:32:28.1782901Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.1783078Z 
2025-05-07T20:32:28.1783186Z moe/activation_test.py:117: 
2025-05-07T20:32:28.1783513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.1783880Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.1784167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.1784889Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.1785599Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.1786160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.1786870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.1787558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.1788111Z     kernel = self.compile(
2025-05-07T20:32:28.1788679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.1789361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.1789774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.1790017Z 
2025-05-07T20:32:28.1790228Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7be53d0>
2025-05-07T20:32:28.1791350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.1792780Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e35580>}
2025-05-07T20:32:28.1794223Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.1795278Z context = <triton._C.libtriton.ir.context object at 0x7f30b7c120f0>
2025-05-07T20:32:28.1795582Z 
2025-05-07T20:32:28.1795754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.1796298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.1796779Z                            module_map=module_map)
2025-05-07T20:32:28.1797159Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.1797524Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.1797791Z E       ^
2025-05-07T20:32:28.1798266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.1798738Z 
2025-05-07T20:32:28.1799317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.1799853Z 
2025-05-07T20:32:28.1799969Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.1800528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.1800945Z     T=1,
2025-05-07T20:32:28.1801129Z     D=7168,
2025-05-07T20:32:28.1801325Z     scale_ub=1200.0,
2025-05-07T20:32:28.1801553Z     contiguous=False,
2025-05-07T20:32:28.1801785Z     compiled=False,
2025-05-07T20:32:28.1801991Z )
2025-05-07T20:32:28.1802312Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.1802865Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.1803140Z 
2025-05-07T20:32:28.1803224Z     @given(
2025-05-07T20:32:28.1803456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.1803827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.1804151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.1804489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.1804830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.1805128Z     )
2025-05-07T20:32:28.1805488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.1805944Z     def test_silu_mul_quant(
2025-05-07T20:32:28.1806193Z         self,
2025-05-07T20:32:28.1806394Z         T: int,
2025-05-07T20:32:28.1806594Z         D: int,
2025-05-07T20:32:28.1806815Z         scale_ub: Optional[float],
2025-05-07T20:32:28.1807099Z         contiguous: bool,
2025-05-07T20:32:28.1807338Z         compiled: bool,
2025-05-07T20:32:28.1807569Z     ) -> None:
2025-05-07T20:32:28.1807790Z         torch.manual_seed(2025)
2025-05-07T20:32:28.1808034Z     
2025-05-07T20:32:28.1808319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.1808677Z     
2025-05-07T20:32:28.1808879Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.1809180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.1809500Z         x = x_sign * x_clamp
2025-05-07T20:32:28.1809742Z         x0 = x[:, :D]
2025-05-07T20:32:28.1809962Z         x1 = x[:, D:]
2025-05-07T20:32:28.1810180Z     
2025-05-07T20:32:28.1810367Z         if contiguous:
2025-05-07T20:32:28.1810605Z             x0 = x0.contiguous()
2025-05-07T20:32:28.1810882Z             x1 = x1.contiguous()
2025-05-07T20:32:28.1811131Z     
2025-05-07T20:32:28.1811322Z         if scale_ub is not None:
2025-05-07T20:32:28.1811608Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.1811959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.1812280Z             )
2025-05-07T20:32:28.1812481Z         else:
2025-05-07T20:32:28.1812698Z             scale_ub_tensor = None
2025-05-07T20:32:28.1812950Z     
2025-05-07T20:32:28.1813195Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.1813829Z             op = silu_mul_quant
2025-05-07T20:32:28.1814083Z             if compiled:
2025-05-07T20:32:28.1814338Z                 op = torch.compile(op)
2025-05-07T20:32:28.1814643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.1814918Z     
2025-05-07T20:32:28.1815113Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.1815286Z 
2025-05-07T20:32:28.1815388Z moe/activation_test.py:117: 
2025-05-07T20:32:28.1815692Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.1816030Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.1816325Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.1817045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.1817752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.1818505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.1819226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.1819915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.1820465Z     kernel = self.compile(
2025-05-07T20:32:28.1821032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.1821716Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.1822177Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.1822421Z 
2025-05-07T20:32:28.1822635Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959fe50>
2025-05-07T20:32:28.1823759Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.1825243Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e368e0>}
2025-05-07T20:32:28.1826626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.1827674Z context = <triton._C.libtriton.ir.context object at 0x7f30b7c73770>
2025-05-07T20:32:28.1827980Z 
2025-05-07T20:32:28.1828154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.1828696Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.1829181Z                            module_map=module_map)
2025-05-07T20:32:28.1829560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.1829933Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.1830219Z E       ^
2025-05-07T20:32:28.1830700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.1831165Z 
2025-05-07T20:32:28.1831592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.1832124Z 
2025-05-07T20:32:28.1832234Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.1832676Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.1833091Z     T=4096,
2025-05-07T20:32:28.1833283Z     D=7168,
2025-05-07T20:32:28.1833482Z     scale_ub=1200.0,
2025-05-07T20:32:28.1833712Z     contiguous=False,
2025-05-07T20:32:28.1833942Z     compiled=True,
2025-05-07T20:32:28.3434543Z )
2025-05-07T20:32:28.3435931Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.3437092Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.3437679Z 
2025-05-07T20:32:28.3437847Z     @given(
2025-05-07T20:32:28.3438344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.3439011Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.3439662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.3440482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.3441181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.3441783Z     )
2025-05-07T20:32:28.3442524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.3443464Z     def test_silu_mul_quant(
2025-05-07T20:32:28.3443970Z         self,
2025-05-07T20:32:28.3444196Z         T: int,
2025-05-07T20:32:28.3444434Z         D: int,
2025-05-07T20:32:28.3444668Z         scale_ub: Optional[float],
2025-05-07T20:32:28.3444958Z         contiguous: bool,
2025-05-07T20:32:28.3445603Z         compiled: bool,
2025-05-07T20:32:28.3445860Z     ) -> None:
2025-05-07T20:32:28.3446087Z         torch.manual_seed(2025)
2025-05-07T20:32:28.3446351Z     
2025-05-07T20:32:28.3446648Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.3447019Z     
2025-05-07T20:32:28.3447223Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.3447540Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.3447874Z         x = x_sign * x_clamp
2025-05-07T20:32:28.3448128Z         x0 = x[:, :D]
2025-05-07T20:32:28.3448456Z         x1 = x[:, D:]
2025-05-07T20:32:28.3448683Z     
2025-05-07T20:32:28.3448876Z         if contiguous:
2025-05-07T20:32:28.3449127Z             x0 = x0.contiguous()
2025-05-07T20:32:28.3449408Z             x1 = x1.contiguous()
2025-05-07T20:32:28.3449664Z     
2025-05-07T20:32:28.3449873Z         if scale_ub is not None:
2025-05-07T20:32:28.3450258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.3450625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.3450964Z             )
2025-05-07T20:32:28.3451207Z         else:
2025-05-07T20:32:28.3451445Z             scale_ub_tensor = None
2025-05-07T20:32:28.3451732Z     
2025-05-07T20:32:28.3451978Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.3452324Z             op = silu_mul_quant
2025-05-07T20:32:28.3452596Z             if compiled:
2025-05-07T20:32:28.3452861Z                 op = torch.compile(op)
2025-05-07T20:32:28.3453186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3453492Z     
2025-05-07T20:32:28.3453695Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.3453882Z 
2025-05-07T20:32:28.3453991Z moe/activation_test.py:117: 
2025-05-07T20:32:28.3454314Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3454677Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.3454979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3455586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.3456185Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.3456879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.3457606Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.3458179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.3458901Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.3459592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.3460154Z     kernel = self.compile(
2025-05-07T20:32:28.3460728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.3461429Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.3461847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3462097Z 
2025-05-07T20:32:28.3462315Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a1e50>
2025-05-07T20:32:28.3463457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.3464913Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e37a60>}
2025-05-07T20:32:28.3466415Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.3467498Z context = <triton._C.libtriton.ir.context object at 0x7f31f811e5b0>
2025-05-07T20:32:28.3467814Z 
2025-05-07T20:32:28.3467992Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.3468553Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.3469045Z                            module_map=module_map)
2025-05-07T20:32:28.3469437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.3469818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.3470137Z E       ^
2025-05-07T20:32:28.3470634Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.3471112Z 
2025-05-07T20:32:28.3471552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.3472134Z 
2025-05-07T20:32:28.3472257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.3472695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.3473128Z     T=128,
2025-05-07T20:32:28.3473332Z     D=7168,
2025-05-07T20:32:28.3473540Z     scale_ub=1200.0,
2025-05-07T20:32:28.3473813Z     contiguous=False,
2025-05-07T20:32:28.3474078Z     compiled=True,
2025-05-07T20:32:28.3474296Z )
2025-05-07T20:32:28.3474640Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.3475169Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:28.3475458Z 
2025-05-07T20:32:28.3475550Z     @given(
2025-05-07T20:32:28.3475791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.3476130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.3476459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.3476810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.3477168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.3477480Z     )
2025-05-07T20:32:28.3477849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.3478324Z     def test_silu_mul_quant(
2025-05-07T20:32:28.3478586Z         self,
2025-05-07T20:32:28.3478803Z         T: int,
2025-05-07T20:32:28.3479005Z         D: int,
2025-05-07T20:32:28.3479246Z         scale_ub: Optional[float],
2025-05-07T20:32:28.3479542Z         contiguous: bool,
2025-05-07T20:32:28.3479796Z         compiled: bool,
2025-05-07T20:32:28.3480042Z     ) -> None:
2025-05-07T20:32:28.3480408Z         torch.manual_seed(2025)
2025-05-07T20:32:28.3480660Z     
2025-05-07T20:32:28.3480954Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.3481319Z     
2025-05-07T20:32:28.3481519Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.3481834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.3482170Z         x = x_sign * x_clamp
2025-05-07T20:32:28.3482420Z         x0 = x[:, :D]
2025-05-07T20:32:28.3482653Z         x1 = x[:, D:]
2025-05-07T20:32:28.3482880Z     
2025-05-07T20:32:28.3483072Z         if contiguous:
2025-05-07T20:32:28.3483320Z             x0 = x0.contiguous()
2025-05-07T20:32:28.3483606Z             x1 = x1.contiguous()
2025-05-07T20:32:28.3483857Z     
2025-05-07T20:32:28.3484077Z         if scale_ub is not None:
2025-05-07T20:32:28.3484417Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.3484777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.3485105Z             )
2025-05-07T20:32:28.3485315Z         else:
2025-05-07T20:32:28.3485545Z             scale_ub_tensor = None
2025-05-07T20:32:28.3485806Z     
2025-05-07T20:32:28.3486054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.3486390Z             op = silu_mul_quant
2025-05-07T20:32:28.3486653Z             if compiled:
2025-05-07T20:32:28.3487012Z                 op = torch.compile(op)
2025-05-07T20:32:28.3487334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3487625Z     
2025-05-07T20:32:28.3487836Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.3488009Z 
2025-05-07T20:32:28.3488124Z moe/activation_test.py:117: 
2025-05-07T20:32:28.3488434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3488790Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.3489092Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.3489684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.3490313Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.3491010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.3491809Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.3492380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.3493296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.3494004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.3494626Z     kernel = self.compile(
2025-05-07T20:32:28.3495193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.3495889Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.3496321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.3496566Z 
2025-05-07T20:32:28.3496792Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959e050>
2025-05-07T20:32:28.3497930Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.3499372Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f817cea0>}
2025-05-07T20:32:28.3500782Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.3501859Z context = <triton._C.libtriton.ir.context object at 0x7f31f81d5370>
2025-05-07T20:32:28.3502162Z 
2025-05-07T20:32:28.3502352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.3502905Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.3503436Z                            module_map=module_map)
2025-05-07T20:32:28.3503858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.3504229Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.3504510Z E       ^
2025-05-07T20:32:28.3505006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.3505478Z 
2025-05-07T20:32:28.3505923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.3506459Z 
2025-05-07T20:32:28.3506572Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.3507021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.3507452Z     T=2048,
2025-05-07T20:32:28.3507663Z     D=7168,
2025-05-07T20:32:28.3507866Z     scale_ub=None,
2025-05-07T20:32:28.3508097Z     contiguous=True,
2025-05-07T20:32:28.3508340Z     compiled=True,
2025-05-07T20:32:28.4771378Z )
2025-05-07T20:32:28.4772133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.4772677Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.4772961Z 
2025-05-07T20:32:28.4773056Z     @given(
2025-05-07T20:32:28.4773298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.4773637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.4773969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.4774317Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.4774673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.4775104Z     )
2025-05-07T20:32:28.4775483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.4775951Z     def test_silu_mul_quant(
2025-05-07T20:32:28.4776211Z         self,
2025-05-07T20:32:28.4776423Z         T: int,
2025-05-07T20:32:28.4776714Z         D: int,
2025-05-07T20:32:28.4776951Z         scale_ub: Optional[float],
2025-05-07T20:32:28.4777249Z         contiguous: bool,
2025-05-07T20:32:28.4777500Z         compiled: bool,
2025-05-07T20:32:28.4777744Z     ) -> None:
2025-05-07T20:32:28.4777978Z         torch.manual_seed(2025)
2025-05-07T20:32:28.4778232Z     
2025-05-07T20:32:28.4778574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.4778928Z     
2025-05-07T20:32:28.4779141Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.4779456Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.4779783Z         x = x_sign * x_clamp
2025-05-07T20:32:28.4780046Z         x0 = x[:, :D]
2025-05-07T20:32:28.4780300Z         x1 = x[:, D:]
2025-05-07T20:32:28.4780524Z     
2025-05-07T20:32:28.4780719Z         if contiguous:
2025-05-07T20:32:28.4780972Z             x0 = x0.contiguous()
2025-05-07T20:32:28.4781251Z             x1 = x1.contiguous()
2025-05-07T20:32:28.4781503Z     
2025-05-07T20:32:28.4781715Z         if scale_ub is not None:
2025-05-07T20:32:28.4782012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.4782367Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.4782702Z             )
2025-05-07T20:32:28.4782915Z         else:
2025-05-07T20:32:28.4783148Z             scale_ub_tensor = None
2025-05-07T20:32:28.4783417Z     
2025-05-07T20:32:28.4783671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.4784013Z             op = silu_mul_quant
2025-05-07T20:32:28.4784277Z             if compiled:
2025-05-07T20:32:28.4784547Z                 op = torch.compile(op)
2025-05-07T20:32:28.4795012Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.4795326Z     
2025-05-07T20:32:28.4795533Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.4795720Z 
2025-05-07T20:32:28.4795831Z moe/activation_test.py:117: 
2025-05-07T20:32:28.4796158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.4796519Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.4796835Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.4797447Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:28.4798051Z     return fn(*args, **kwargs)
2025-05-07T20:32:28.4798748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.4799484Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.4800064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.4800897Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.4801607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.4802176Z     kernel = self.compile(
2025-05-07T20:32:28.4802896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.4803590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.4804018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.4804259Z 
2025-05-07T20:32:28.4804487Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9d120d0>
2025-05-07T20:32:28.4805631Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.4807136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f817dc60>}
2025-05-07T20:32:28.4808550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.4809674Z context = <triton._C.libtriton.ir.context object at 0x7f30b7b00a30>
2025-05-07T20:32:28.4809978Z 
2025-05-07T20:32:28.4810164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.4810714Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.4811215Z                            module_map=module_map)
2025-05-07T20:32:28.4811603Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.4811986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.4812259Z E       ^
2025-05-07T20:32:28.4812754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.4813227Z 
2025-05-07T20:32:28.4814115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.4814660Z 
2025-05-07T20:32:28.4814775Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.4815216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.4815643Z     T=16384,
2025-05-07T20:32:28.4815849Z     D=5120,
2025-05-07T20:32:28.4816047Z     scale_ub=None,
2025-05-07T20:32:28.4816281Z     contiguous=False,
2025-05-07T20:32:28.4816524Z     compiled=False,
2025-05-07T20:32:28.4816736Z )
2025-05-07T20:32:28.4817074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.4817606Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.4817901Z 
2025-05-07T20:32:28.4817983Z     @given(
2025-05-07T20:32:28.4818230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.4818564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.4818890Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.4819249Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.4819600Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.4819908Z     )
2025-05-07T20:32:28.4820272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.4820739Z     def test_silu_mul_quant(
2025-05-07T20:32:28.4821000Z         self,
2025-05-07T20:32:28.4821202Z         T: int,
2025-05-07T20:32:28.4821416Z         D: int,
2025-05-07T20:32:28.4821649Z         scale_ub: Optional[float],
2025-05-07T20:32:28.4821932Z         contiguous: bool,
2025-05-07T20:32:28.4822195Z         compiled: bool,
2025-05-07T20:32:28.4822438Z     ) -> None:
2025-05-07T20:32:28.4822662Z         torch.manual_seed(2025)
2025-05-07T20:32:28.4822922Z     
2025-05-07T20:32:28.4823214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.4823607Z     
2025-05-07T20:32:28.4823833Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.4824332Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.4826444Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.4828436Z 
2025-05-07T20:32:28.4828571Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.4828794Z 
2025-05-07T20:32:28.4828902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.4829342Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.4829832Z     T=4096,
2025-05-07T20:32:28.4830032Z     D=7168,
2025-05-07T20:32:28.4830237Z     scale_ub=1200.0,
2025-05-07T20:32:28.4830478Z     contiguous=True,
2025-05-07T20:32:28.4830705Z     compiled=True,
2025-05-07T20:32:28.4830926Z )
2025-05-07T20:32:28.4831263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.4831786Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.4832069Z 
2025-05-07T20:32:28.4832151Z     @given(
2025-05-07T20:32:28.4832396Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.4832729Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.4833047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.4833400Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.4833800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.4834093Z     )
2025-05-07T20:32:28.4834470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.4834940Z     def test_silu_mul_quant(
2025-05-07T20:32:28.4835198Z         self,
2025-05-07T20:32:28.4835398Z         T: int,
2025-05-07T20:32:28.4835607Z         D: int,
2025-05-07T20:32:28.4835840Z         scale_ub: Optional[float],
2025-05-07T20:32:28.4836120Z         contiguous: bool,
2025-05-07T20:32:28.4836377Z         compiled: bool,
2025-05-07T20:32:28.4836615Z     ) -> None:
2025-05-07T20:32:28.4836835Z         torch.manual_seed(2025)
2025-05-07T20:32:28.4837094Z     
2025-05-07T20:32:28.4837384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.4837740Z     
2025-05-07T20:32:28.4837950Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.4838263Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.4840437Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.4842373Z 
2025-05-07T20:32:28.4842508Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.4842729Z 
2025-05-07T20:32:28.4842838Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.4843278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.4843707Z     T=16384,
2025-05-07T20:32:28.4843906Z     D=7168,
2025-05-07T20:32:28.4844116Z     scale_ub=None,
2025-05-07T20:32:28.4844364Z     contiguous=False,
2025-05-07T20:32:28.4844633Z     compiled=False,
2025-05-07T20:32:28.4844854Z )
2025-05-07T20:32:28.4845286Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.4845806Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.4846107Z 
2025-05-07T20:32:28.4846189Z     @given(
2025-05-07T20:32:28.4846436Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.4846770Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.4847085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.4847431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.4847780Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.4848119Z     )
2025-05-07T20:32:28.4848487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.4848949Z     def test_silu_mul_quant(
2025-05-07T20:32:28.4849197Z         self,
2025-05-07T20:32:28.4849404Z         T: int,
2025-05-07T20:32:28.4849613Z         D: int,
2025-05-07T20:32:28.4849885Z         scale_ub: Optional[float],
2025-05-07T20:32:28.4850179Z         contiguous: bool,
2025-05-07T20:32:28.4850434Z         compiled: bool,
2025-05-07T20:32:28.4850671Z     ) -> None:
2025-05-07T20:32:28.4850891Z         torch.manual_seed(2025)
2025-05-07T20:32:28.4851149Z     
2025-05-07T20:32:28.4851440Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.4853558Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.4855500Z 
2025-05-07T20:32:28.4855627Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.6077968Z 
2025-05-07T20:32:28.6078365Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6078862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6079299Z     T=2048,
2025-05-07T20:32:28.6079550Z     D=7168,
2025-05-07T20:32:28.6079754Z     scale_ub=1200.0,
2025-05-07T20:32:28.6079999Z     contiguous=True,
2025-05-07T20:32:28.6080332Z     compiled=True,
2025-05-07T20:32:28.6080562Z )
2025-05-07T20:32:28.6080899Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6081443Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:28.6081733Z 
2025-05-07T20:32:28.6081828Z     @given(
2025-05-07T20:32:28.6082072Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6082413Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6082748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6083109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6083470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6083785Z     )
2025-05-07T20:32:28.6084168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6084639Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6084908Z         self,
2025-05-07T20:32:28.6085125Z         T: int,
2025-05-07T20:32:28.6085334Z         D: int,
2025-05-07T20:32:28.6085578Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6085872Z         contiguous: bool,
2025-05-07T20:32:28.6086133Z         compiled: bool,
2025-05-07T20:32:28.6086378Z     ) -> None:
2025-05-07T20:32:28.6086613Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6086870Z     
2025-05-07T20:32:28.6087168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6087533Z     
2025-05-07T20:32:28.6087746Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6088397Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6090498Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.6092551Z 
2025-05-07T20:32:28.6092677Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:28.6092900Z 
2025-05-07T20:32:28.6093016Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6093449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6093992Z     T=2048,
2025-05-07T20:32:28.6094194Z     D=7168,
2025-05-07T20:32:28.6094408Z     scale_ub=None,
2025-05-07T20:32:28.6094629Z     contiguous=True,
2025-05-07T20:32:28.6094865Z     compiled=False,
2025-05-07T20:32:28.6095083Z )
2025-05-07T20:32:28.6095410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6095937Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.6096220Z 
2025-05-07T20:32:28.6096310Z     @given(
2025-05-07T20:32:28.6096547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6096879Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6097207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6097548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6097904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6098209Z     )
2025-05-07T20:32:28.6098578Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6099047Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6099304Z         self,
2025-05-07T20:32:28.6099510Z         T: int,
2025-05-07T20:32:28.6099713Z         D: int,
2025-05-07T20:32:28.6099945Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6100231Z         contiguous: bool,
2025-05-07T20:32:28.6100480Z         compiled: bool,
2025-05-07T20:32:28.6100716Z     ) -> None:
2025-05-07T20:32:28.6100941Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6101194Z     
2025-05-07T20:32:28.6101481Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6101844Z     
2025-05-07T20:32:28.6102043Z >       x_sign = torch.sign(x)
2025-05-07T20:32:28.6104058Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.6105978Z 
2025-05-07T20:32:28.6106102Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:28.6106331Z 
2025-05-07T20:32:28.6106439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6106878Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6107293Z     T=1,
2025-05-07T20:32:28.6107491Z     D=7168,
2025-05-07T20:32:28.6107695Z     scale_ub=1200.0,
2025-05-07T20:32:28.6107925Z     contiguous=True,
2025-05-07T20:32:28.6108161Z     compiled=False,
2025-05-07T20:32:28.6108377Z )
2025-05-07T20:32:28.6108708Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6109226Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.6109601Z 
2025-05-07T20:32:28.6109683Z     @given(
2025-05-07T20:32:28.6109923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6110245Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6110569Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6110916Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6111256Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6111558Z     )
2025-05-07T20:32:28.6111926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6112432Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6112687Z         self,
2025-05-07T20:32:28.6112891Z         T: int,
2025-05-07T20:32:28.6113091Z         D: int,
2025-05-07T20:32:28.6113633Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6113999Z         contiguous: bool,
2025-05-07T20:32:28.6114359Z         compiled: bool,
2025-05-07T20:32:28.6114589Z     ) -> None:
2025-05-07T20:32:28.6114822Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6115079Z     
2025-05-07T20:32:28.6115360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6115718Z     
2025-05-07T20:32:28.6115930Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6116228Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6116558Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6116811Z         x0 = x[:, :D]
2025-05-07T20:32:28.6117033Z         x1 = x[:, D:]
2025-05-07T20:32:28.6117256Z     
2025-05-07T20:32:28.6117455Z         if contiguous:
2025-05-07T20:32:28.6117699Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6117973Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6118226Z     
2025-05-07T20:32:28.6118423Z         if scale_ub is not None:
2025-05-07T20:32:28.6118709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6119061Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6119393Z             )
2025-05-07T20:32:28.6119589Z         else:
2025-05-07T20:32:28.6119810Z             scale_ub_tensor = None
2025-05-07T20:32:28.6120073Z     
2025-05-07T20:32:28.6120425Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6120757Z             op = silu_mul_quant
2025-05-07T20:32:28.6121020Z             if compiled:
2025-05-07T20:32:28.6121274Z                 op = torch.compile(op)
2025-05-07T20:32:28.6121589Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6121876Z     
2025-05-07T20:32:28.6122073Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6122254Z 
2025-05-07T20:32:28.6122358Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6122669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6123012Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6123310Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6124058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6124811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6125370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6126085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6126780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6127339Z     kernel = self.compile(
2025-05-07T20:32:28.6127907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6128596Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6129021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6129264Z 
2025-05-07T20:32:28.6129608Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f838f450>
2025-05-07T20:32:28.6130738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6132169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc4b80>}
2025-05-07T20:32:28.6133569Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6134743Z context = <triton._C.libtriton.ir.context object at 0x7f30b79e75b0>
2025-05-07T20:32:28.6135044Z 
2025-05-07T20:32:28.6135219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6135823Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6136319Z                            module_map=module_map)
2025-05-07T20:32:28.6136698Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6137077Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6137353Z E       ^
2025-05-07T20:32:28.6137842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6138311Z 
2025-05-07T20:32:28.6138749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.6139293Z 
2025-05-07T20:32:28.6139403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.6139839Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.6140265Z     T=128,
2025-05-07T20:32:28.6140464Z     D=5120,
2025-05-07T20:32:28.6140670Z     scale_ub=None,
2025-05-07T20:32:28.6140902Z     contiguous=True,
2025-05-07T20:32:28.6141130Z     compiled=False,
2025-05-07T20:32:28.6141346Z )
2025-05-07T20:32:28.6141675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.6142186Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.6142479Z 
2025-05-07T20:32:28.6142561Z     @given(
2025-05-07T20:32:28.6142804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.6143125Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.6143446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.6143793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.6144142Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.6144436Z     )
2025-05-07T20:32:28.6144800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.6145263Z     def test_silu_mul_quant(
2025-05-07T20:32:28.6145516Z         self,
2025-05-07T20:32:28.6145725Z         T: int,
2025-05-07T20:32:28.6145931Z         D: int,
2025-05-07T20:32:28.6146154Z         scale_ub: Optional[float],
2025-05-07T20:32:28.6146440Z         contiguous: bool,
2025-05-07T20:32:28.6146690Z         compiled: bool,
2025-05-07T20:32:28.6146916Z     ) -> None:
2025-05-07T20:32:28.6147139Z         torch.manual_seed(2025)
2025-05-07T20:32:28.6147392Z     
2025-05-07T20:32:28.6147672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.6148038Z     
2025-05-07T20:32:28.6148244Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.6148546Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.6148873Z         x = x_sign * x_clamp
2025-05-07T20:32:28.6149127Z         x0 = x[:, :D]
2025-05-07T20:32:28.6149354Z         x1 = x[:, D:]
2025-05-07T20:32:28.6149565Z     
2025-05-07T20:32:28.6149762Z         if contiguous:
2025-05-07T20:32:28.6150011Z             x0 = x0.contiguous()
2025-05-07T20:32:28.6150367Z             x1 = x1.contiguous()
2025-05-07T20:32:28.6150621Z     
2025-05-07T20:32:28.6150827Z         if scale_ub is not None:
2025-05-07T20:32:28.6151113Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.6151466Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.6151791Z             )
2025-05-07T20:32:28.6151989Z         else:
2025-05-07T20:32:28.6152215Z             scale_ub_tensor = None
2025-05-07T20:32:28.6152481Z     
2025-05-07T20:32:28.6152719Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.6153104Z             op = silu_mul_quant
2025-05-07T20:32:28.6153370Z             if compiled:
2025-05-07T20:32:28.6153630Z                 op = torch.compile(op)
2025-05-07T20:32:28.6153948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6154294Z     
2025-05-07T20:32:28.6154500Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.6154715Z 
2025-05-07T20:32:28.6154821Z moe/activation_test.py:117: 
2025-05-07T20:32:28.6155137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6155487Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.6155780Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.6156503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.6157219Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.6157785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.6158493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.6159189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.6159743Z     kernel = self.compile(
2025-05-07T20:32:28.6160399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.6161085Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.6161501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.6161739Z 
2025-05-07T20:32:28.6161959Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7b7c1d0>
2025-05-07T20:32:28.6163075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.6164510Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc5a80>}
2025-05-07T20:32:28.6165914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.6166976Z context = <triton._C.libtriton.ir.context object at 0x7f30b7974ff0>
2025-05-07T20:32:28.6167276Z 
2025-05-07T20:32:28.6167456Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.6167997Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.6168487Z                            module_map=module_map)
2025-05-07T20:32:28.6168871Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.6169239Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.6169531Z E       ^
2025-05-07T20:32:28.6170011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.6170486Z 
2025-05-07T20:32:28.6170920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7298273Z 
2025-05-07T20:32:28.7298575Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7299073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7299519Z     T=128,
2025-05-07T20:32:28.7299738Z     D=7168,
2025-05-07T20:32:28.7310005Z     scale_ub=None,
2025-05-07T20:32:28.7310277Z     contiguous=True,
2025-05-07T20:32:28.7310527Z     compiled=False,
2025-05-07T20:32:28.7310757Z )
2025-05-07T20:32:28.7311147Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7311834Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.7312118Z 
2025-05-07T20:32:28.7312201Z     @given(
2025-05-07T20:32:28.7312456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7312797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7313130Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7313860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7314219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7314530Z     )
2025-05-07T20:32:28.7314899Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7315370Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7315638Z         self,
2025-05-07T20:32:28.7315844Z         T: int,
2025-05-07T20:32:28.7316060Z         D: int,
2025-05-07T20:32:28.7316297Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7316585Z         contiguous: bool,
2025-05-07T20:32:28.7316851Z         compiled: bool,
2025-05-07T20:32:28.7317096Z     ) -> None:
2025-05-07T20:32:28.7317326Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7317591Z     
2025-05-07T20:32:28.7317887Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7318254Z     
2025-05-07T20:32:28.7318457Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7318780Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7319110Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7319363Z         x0 = x[:, :D]
2025-05-07T20:32:28.7319595Z         x1 = x[:, D:]
2025-05-07T20:32:28.7319822Z     
2025-05-07T20:32:28.7320017Z         if contiguous:
2025-05-07T20:32:28.7320377Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7320656Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7320907Z     
2025-05-07T20:32:28.7321118Z         if scale_ub is not None:
2025-05-07T20:32:28.7321413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7321770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7322104Z             )
2025-05-07T20:32:28.7322315Z         else:
2025-05-07T20:32:28.7322534Z             scale_ub_tensor = None
2025-05-07T20:32:28.7322806Z     
2025-05-07T20:32:28.7323059Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7323397Z             op = silu_mul_quant
2025-05-07T20:32:28.7323662Z             if compiled:
2025-05-07T20:32:28.7323929Z                 op = torch.compile(op)
2025-05-07T20:32:28.7324250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7324540Z     
2025-05-07T20:32:28.7324749Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7324926Z 
2025-05-07T20:32:28.7325040Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7325355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7325714Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7326022Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7326747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7327473Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7328044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7328904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7329605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7330170Z     kernel = self.compile(
2025-05-07T20:32:28.7330741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7331433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7331850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7332159Z 
2025-05-07T20:32:28.7332378Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959e5d0>
2025-05-07T20:32:28.7333513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7335026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc6980>}
2025-05-07T20:32:28.7336420Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7337494Z context = <triton._C.libtriton.ir.context object at 0x7f30b79fa8b0>
2025-05-07T20:32:28.7337807Z 
2025-05-07T20:32:28.7337987Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7338544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7339036Z                            module_map=module_map)
2025-05-07T20:32:28.7339431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7339816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7340095Z E       ^
2025-05-07T20:32:28.7340592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7341071Z 
2025-05-07T20:32:28.7341506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.7342039Z 
2025-05-07T20:32:28.7342156Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7342591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7343020Z     T=2048,
2025-05-07T20:32:28.7343229Z     D=7168,
2025-05-07T20:32:28.7343433Z     scale_ub=1200.0,
2025-05-07T20:32:28.7343673Z     contiguous=True,
2025-05-07T20:32:28.7343911Z     compiled=False,
2025-05-07T20:32:28.7344132Z )
2025-05-07T20:32:28.7344464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7344994Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7345288Z 
2025-05-07T20:32:28.7345377Z     @given(
2025-05-07T20:32:28.7345618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7345954Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7346283Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7346631Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7346986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7347293Z     )
2025-05-07T20:32:28.7347662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7348135Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7348396Z         self,
2025-05-07T20:32:28.7348608Z         T: int,
2025-05-07T20:32:28.7348815Z         D: int,
2025-05-07T20:32:28.7349050Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7349346Z         contiguous: bool,
2025-05-07T20:32:28.7349602Z         compiled: bool,
2025-05-07T20:32:28.7349843Z     ) -> None:
2025-05-07T20:32:28.7350163Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7350420Z     
2025-05-07T20:32:28.7350714Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7352861Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.7354846Z 
2025-05-07T20:32:28.7354995Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.7355234Z 
2025-05-07T20:32:28.7355400Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.7355842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.7356268Z     T=1,
2025-05-07T20:32:28.7356472Z     D=5120,
2025-05-07T20:32:28.7356675Z     scale_ub=1200.0,
2025-05-07T20:32:28.7356920Z     contiguous=True,
2025-05-07T20:32:28.7357160Z     compiled=False,
2025-05-07T20:32:28.7357374Z )
2025-05-07T20:32:28.7357715Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.7358235Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.7358516Z 
2025-05-07T20:32:28.7358601Z     @given(
2025-05-07T20:32:28.7358850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.7359185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.7359514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.7359860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.7360314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.7360631Z     )
2025-05-07T20:32:28.7360998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.7361469Z     def test_silu_mul_quant(
2025-05-07T20:32:28.7361730Z         self,
2025-05-07T20:32:28.7361934Z         T: int,
2025-05-07T20:32:28.7362151Z         D: int,
2025-05-07T20:32:28.7362387Z         scale_ub: Optional[float],
2025-05-07T20:32:28.7362674Z         contiguous: bool,
2025-05-07T20:32:28.7362941Z         compiled: bool,
2025-05-07T20:32:28.7363185Z     ) -> None:
2025-05-07T20:32:28.7363414Z         torch.manual_seed(2025)
2025-05-07T20:32:28.7363687Z     
2025-05-07T20:32:28.7363980Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.7364343Z     
2025-05-07T20:32:28.7364546Z         x_sign = torch.sign(x)
2025-05-07T20:32:28.7364858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:28.7365194Z         x = x_sign * x_clamp
2025-05-07T20:32:28.7365446Z         x0 = x[:, :D]
2025-05-07T20:32:28.7365686Z         x1 = x[:, D:]
2025-05-07T20:32:28.7365915Z     
2025-05-07T20:32:28.7366111Z         if contiguous:
2025-05-07T20:32:28.7366361Z             x0 = x0.contiguous()
2025-05-07T20:32:28.7366649Z             x1 = x1.contiguous()
2025-05-07T20:32:28.7366906Z     
2025-05-07T20:32:28.7367117Z         if scale_ub is not None:
2025-05-07T20:32:28.7367414Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:28.7367770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:28.7368103Z             )
2025-05-07T20:32:28.7368317Z         else:
2025-05-07T20:32:28.7368540Z             scale_ub_tensor = None
2025-05-07T20:32:28.7368811Z     
2025-05-07T20:32:28.7369064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:28.7369403Z             op = silu_mul_quant
2025-05-07T20:32:28.7369666Z             if compiled:
2025-05-07T20:32:28.7369933Z                 op = torch.compile(op)
2025-05-07T20:32:28.7370340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7370629Z     
2025-05-07T20:32:28.7370835Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:28.7371008Z 
2025-05-07T20:32:28.7371122Z moe/activation_test.py:117: 
2025-05-07T20:32:28.7371431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7371785Z moe/activation_test.py:115: in fn
2025-05-07T20:32:28.7372084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:28.7372809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:28.7373570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:28.7374140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:28.7374858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:28.7375606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:28.7376160Z     kernel = self.compile(
2025-05-07T20:32:28.7376732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:28.7377422Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:28.7377839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:28.7378087Z 
2025-05-07T20:32:28.7378305Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a12d0>
2025-05-07T20:32:28.7379431Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:28.7380863Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc7e20>}
2025-05-07T20:32:28.7382261Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:28.7383323Z context = <triton._C.libtriton.ir.context object at 0x7f30b7a21a30>
2025-05-07T20:32:28.7383632Z 
2025-05-07T20:32:28.7383810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:28.7384366Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:28.7384861Z                            module_map=module_map)
2025-05-07T20:32:28.7385242Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:28.7385620Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:28.7385899Z E       ^
2025-05-07T20:32:28.7386397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:28.7386877Z 
2025-05-07T20:32:28.7387314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:28.8194342Z 
2025-05-07T20:32:28.8194581Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8195042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8195461Z     T=2048,
2025-05-07T20:32:28.8195660Z     D=5120,
2025-05-07T20:32:28.8195865Z     scale_ub=None,
2025-05-07T20:32:28.8196083Z     contiguous=True,
2025-05-07T20:32:28.8196332Z     compiled=False,
2025-05-07T20:32:28.8196579Z )
2025-05-07T20:32:28.8196920Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8197448Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.8197727Z 
2025-05-07T20:32:28.8197814Z     @given(
2025-05-07T20:32:28.8198400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8198735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8199049Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8199401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8199748Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8200049Z     )
2025-05-07T20:32:28.8200495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8200958Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8201210Z         self,
2025-05-07T20:32:28.8201485Z         T: int,
2025-05-07T20:32:28.8201694Z         D: int,
2025-05-07T20:32:28.8201920Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8202201Z         contiguous: bool,
2025-05-07T20:32:28.8202461Z         compiled: bool,
2025-05-07T20:32:28.8202699Z     ) -> None:
2025-05-07T20:32:28.8202919Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8203249Z     
2025-05-07T20:32:28.8203572Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8203959Z     
2025-05-07T20:32:28.8204163Z >       x_sign = torch.sign(x)
2025-05-07T20:32:28.8206193Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8208128Z 
2025-05-07T20:32:28.8208252Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:28.8208473Z 
2025-05-07T20:32:28.8208585Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8209020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8209441Z     T=16384,
2025-05-07T20:32:28.8209647Z     D=5120,
2025-05-07T20:32:28.8209840Z     scale_ub=None,
2025-05-07T20:32:28.8210069Z     contiguous=True,
2025-05-07T20:32:28.8210304Z     compiled=False,
2025-05-07T20:32:28.8210513Z )
2025-05-07T20:32:28.8210846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8211372Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.8211664Z 
2025-05-07T20:32:28.8211751Z     @given(
2025-05-07T20:32:28.8211988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8212317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8212641Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8212983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8213632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8214011Z     )
2025-05-07T20:32:28.8214379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8214844Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8215101Z         self,
2025-05-07T20:32:28.8215305Z         T: int,
2025-05-07T20:32:28.8215506Z         D: int,
2025-05-07T20:32:28.8215736Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8216025Z         contiguous: bool,
2025-05-07T20:32:28.8216272Z         compiled: bool,
2025-05-07T20:32:28.8216505Z     ) -> None:
2025-05-07T20:32:28.8216729Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8216980Z     
2025-05-07T20:32:28.8217264Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8219558Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8221499Z 
2025-05-07T20:32:28.8221629Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8221850Z 
2025-05-07T20:32:28.8221963Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8222392Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8222874Z     T=4096,
2025-05-07T20:32:28.8223070Z     D=5120,
2025-05-07T20:32:28.8223263Z     scale_ub=None,
2025-05-07T20:32:28.8223485Z     contiguous=True,
2025-05-07T20:32:28.8223719Z     compiled=False,
2025-05-07T20:32:28.8223927Z )
2025-05-07T20:32:28.8224262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8224853Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.8225134Z 
2025-05-07T20:32:28.8225214Z     @given(
2025-05-07T20:32:28.8225458Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8225788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8226107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8226448Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8226790Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8227089Z     )
2025-05-07T20:32:28.8227456Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8227923Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8228177Z         self,
2025-05-07T20:32:28.8228374Z         T: int,
2025-05-07T20:32:28.8228584Z         D: int,
2025-05-07T20:32:28.8228812Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8229097Z         contiguous: bool,
2025-05-07T20:32:28.8229349Z         compiled: bool,
2025-05-07T20:32:28.8229588Z     ) -> None:
2025-05-07T20:32:28.8229807Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8230068Z     
2025-05-07T20:32:28.8230357Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8232480Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8234454Z 
2025-05-07T20:32:28.8234582Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8234808Z 
2025-05-07T20:32:28.8234925Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8235360Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8235779Z     T=2048,
2025-05-07T20:32:28.8235970Z     D=5120,
2025-05-07T20:32:28.8236173Z     scale_ub=None,
2025-05-07T20:32:28.8236399Z     contiguous=False,
2025-05-07T20:32:28.8236635Z     compiled=False,
2025-05-07T20:32:28.8236851Z )
2025-05-07T20:32:28.8237185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8237707Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:28.8237997Z 
2025-05-07T20:32:28.8238078Z     @given(
2025-05-07T20:32:28.8238321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8238650Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8238964Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8239310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8239802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8240100Z     )
2025-05-07T20:32:28.8240531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8240996Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8241248Z         self,
2025-05-07T20:32:28.8241445Z         T: int,
2025-05-07T20:32:28.8241651Z         D: int,
2025-05-07T20:32:28.8241881Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8242165Z         contiguous: bool,
2025-05-07T20:32:28.8242416Z         compiled: bool,
2025-05-07T20:32:28.8242650Z     ) -> None:
2025-05-07T20:32:28.8242917Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8243171Z     
2025-05-07T20:32:28.8243460Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8245584Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8247550Z 
2025-05-07T20:32:28.8247675Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8247902Z 
2025-05-07T20:32:28.8248011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8248445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8248866Z     T=4096,
2025-05-07T20:32:28.8249057Z     D=7168,
2025-05-07T20:32:28.8249259Z     scale_ub=None,
2025-05-07T20:32:28.8249484Z     contiguous=True,
2025-05-07T20:32:28.8249712Z     compiled=True,
2025-05-07T20:32:28.8249925Z )
2025-05-07T20:32:28.8250261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8250775Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:28.8251061Z 
2025-05-07T20:32:28.8251149Z     @given(
2025-05-07T20:32:28.8251384Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8251713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8252037Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8252377Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8252721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8253026Z     )
2025-05-07T20:32:28.8253388Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8253856Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8254110Z         self,
2025-05-07T20:32:28.8254316Z         T: int,
2025-05-07T20:32:28.8254517Z         D: int,
2025-05-07T20:32:28.8254755Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8255042Z         contiguous: bool,
2025-05-07T20:32:28.8255298Z         compiled: bool,
2025-05-07T20:32:28.8255531Z     ) -> None:
2025-05-07T20:32:28.8255761Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8256009Z     
2025-05-07T20:32:28.8256292Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8258419Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8260340Z 
2025-05-07T20:32:28.8260470Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8260775Z 
2025-05-07T20:32:28.8260889Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8261315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8261732Z     T=2048,
2025-05-07T20:32:28.8261927Z     D=5120,
2025-05-07T20:32:28.8262123Z     scale_ub=1200.0,
2025-05-07T20:32:28.8262357Z     contiguous=False,
2025-05-07T20:32:28.8262592Z     compiled=False,
2025-05-07T20:32:28.8809581Z )
2025-05-07T20:32:28.8809939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8810468Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:28.8810959Z 
2025-05-07T20:32:28.8811061Z     @given(
2025-05-07T20:32:28.8811295Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8811621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8811940Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8812400Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8812742Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8813039Z     )
2025-05-07T20:32:28.8813736Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8814230Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8814486Z         self,
2025-05-07T20:32:28.8814681Z         T: int,
2025-05-07T20:32:28.8814887Z         D: int,
2025-05-07T20:32:28.8815115Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8815401Z         contiguous: bool,
2025-05-07T20:32:28.8815655Z         compiled: bool,
2025-05-07T20:32:28.8815932Z     ) -> None:
2025-05-07T20:32:28.8816164Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8816423Z     
2025-05-07T20:32:28.8816696Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8818815Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8831111Z 
2025-05-07T20:32:28.8831247Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8831476Z 
2025-05-07T20:32:28.8831600Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8832035Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8832448Z     T=4096,
2025-05-07T20:32:28.8832645Z     D=7168,
2025-05-07T20:32:28.8832846Z     scale_ub=1200.0,
2025-05-07T20:32:28.8833077Z     contiguous=True,
2025-05-07T20:32:28.8833314Z     compiled=False,
2025-05-07T20:32:28.8833535Z )
2025-05-07T20:32:28.8833912Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8834434Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.8834718Z 
2025-05-07T20:32:28.8834805Z     @given(
2025-05-07T20:32:28.8835039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8835367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8835688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8836032Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8836373Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8836671Z     )
2025-05-07T20:32:28.8837036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8837486Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8837737Z         self,
2025-05-07T20:32:28.8837945Z         T: int,
2025-05-07T20:32:28.8838145Z         D: int,
2025-05-07T20:32:28.8838583Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8838871Z         contiguous: bool,
2025-05-07T20:32:28.8839114Z         compiled: bool,
2025-05-07T20:32:28.8839347Z     ) -> None:
2025-05-07T20:32:28.8839570Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8839815Z     
2025-05-07T20:32:28.8840098Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8842309Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8844379Z 
2025-05-07T20:32:28.8844510Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8844731Z 
2025-05-07T20:32:28.8844847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8845270Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8845686Z     T=16384,
2025-05-07T20:32:28.8845887Z     D=7168,
2025-05-07T20:32:28.8846081Z     scale_ub=None,
2025-05-07T20:32:28.8846303Z     contiguous=False,
2025-05-07T20:32:28.8846541Z     compiled=True,
2025-05-07T20:32:28.8846744Z )
2025-05-07T20:32:28.8847075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8847596Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:28.8847886Z 
2025-05-07T20:32:28.8847971Z     @given(
2025-05-07T20:32:28.8848201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8848528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8848848Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8849184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8849529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8849827Z     )
2025-05-07T20:32:28.8850187Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8850650Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8850905Z         self,
2025-05-07T20:32:28.8851112Z         T: int,
2025-05-07T20:32:28.8851310Z         D: int,
2025-05-07T20:32:28.8851537Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8851823Z         contiguous: bool,
2025-05-07T20:32:28.8852067Z         compiled: bool,
2025-05-07T20:32:28.8852299Z     ) -> None:
2025-05-07T20:32:28.8852523Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8852769Z     
2025-05-07T20:32:28.8853053Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8855180Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8857097Z 
2025-05-07T20:32:28.8857227Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8857447Z 
2025-05-07T20:32:28.8857562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8857989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8858412Z     T=4096,
2025-05-07T20:32:28.8858608Z     D=7168,
2025-05-07T20:32:28.8858803Z     scale_ub=None,
2025-05-07T20:32:28.8859027Z     contiguous=True,
2025-05-07T20:32:28.8859345Z     compiled=False,
2025-05-07T20:32:28.8859553Z )
2025-05-07T20:32:28.8859887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8860401Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.8860679Z 
2025-05-07T20:32:28.8860758Z     @given(
2025-05-07T20:32:28.8860996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8861320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8861636Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8862019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8862362Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8862660Z     )
2025-05-07T20:32:28.8863017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8863474Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8863766Z         self,
2025-05-07T20:32:28.8863963Z         T: int,
2025-05-07T20:32:28.8864177Z         D: int,
2025-05-07T20:32:28.8864406Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8864699Z         contiguous: bool,
2025-05-07T20:32:28.8864978Z         compiled: bool,
2025-05-07T20:32:28.8865205Z     ) -> None:
2025-05-07T20:32:28.8865420Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8865672Z     
2025-05-07T20:32:28.8865951Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8868072Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8869990Z 
2025-05-07T20:32:28.8870119Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8870337Z 
2025-05-07T20:32:28.8870443Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8870875Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8871291Z     T=16384,
2025-05-07T20:32:28.8871485Z     D=7168,
2025-05-07T20:32:28.8871689Z     scale_ub=None,
2025-05-07T20:32:28.8871914Z     contiguous=True,
2025-05-07T20:32:28.8872140Z     compiled=False,
2025-05-07T20:32:28.8872360Z )
2025-05-07T20:32:28.8872690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8873209Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:28.8873499Z 
2025-05-07T20:32:28.8873579Z     @given(
2025-05-07T20:32:28.8873820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8874156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8874494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8874865Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8875208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8875500Z     )
2025-05-07T20:32:28.8875863Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8876322Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8876573Z         self,
2025-05-07T20:32:28.8876770Z         T: int,
2025-05-07T20:32:28.8876976Z         D: int,
2025-05-07T20:32:28.8877206Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8877484Z         contiguous: bool,
2025-05-07T20:32:28.8877735Z         compiled: bool,
2025-05-07T20:32:28.8877967Z     ) -> None:
2025-05-07T20:32:28.8878182Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8878436Z     
2025-05-07T20:32:28.8878720Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8881006Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8882971Z 
2025-05-07T20:32:28.8883092Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:28.8883317Z 
2025-05-07T20:32:28.8883423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:28.8883854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:28.8884316Z     T=16384,
2025-05-07T20:32:28.8884509Z     D=7168,
2025-05-07T20:32:28.8884715Z     scale_ub=1200.0,
2025-05-07T20:32:28.8884945Z     contiguous=True,
2025-05-07T20:32:28.8885175Z     compiled=False,
2025-05-07T20:32:28.8885385Z )
2025-05-07T20:32:28.8885718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:28.8886230Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:28.8886528Z 
2025-05-07T20:32:28.8886607Z     @given(
2025-05-07T20:32:28.8886846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:28.8887169Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:28.8887483Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:28.8887823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:28.8888166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:28.8888459Z     )
2025-05-07T20:32:28.8888831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:28.8889299Z     def test_silu_mul_quant(
2025-05-07T20:32:28.8889551Z         self,
2025-05-07T20:32:28.8889755Z         T: int,
2025-05-07T20:32:28.8889962Z         D: int,
2025-05-07T20:32:28.8890183Z         scale_ub: Optional[float],
2025-05-07T20:32:28.8890469Z         contiguous: bool,
2025-05-07T20:32:28.8890722Z         compiled: bool,
2025-05-07T20:32:28.8890946Z     ) -> None:
2025-05-07T20:32:28.8891174Z         torch.manual_seed(2025)
2025-05-07T20:32:28.8891425Z     
2025-05-07T20:32:28.8891699Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:28.8893877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:28.8895808Z 
2025-05-07T20:32:28.8895929Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.0678928Z 
2025-05-07T20:32:29.0679415Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.0680064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.0680728Z     T=128,
2025-05-07T20:32:29.0681054Z     D=5120,
2025-05-07T20:32:29.0681319Z     scale_ub=1200.0,
2025-05-07T20:32:29.0681633Z     contiguous=False,
2025-05-07T20:32:29.0681932Z     compiled=False,
2025-05-07T20:32:29.0682176Z )
2025-05-07T20:32:29.0682520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.0683037Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.0683330Z 
2025-05-07T20:32:29.0683410Z     @given(
2025-05-07T20:32:29.0683876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.0684208Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.0684519Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.0684864Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.0685215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.0685510Z     )
2025-05-07T20:32:29.0685877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.0686335Z     def test_silu_mul_quant(
2025-05-07T20:32:29.0686591Z         self,
2025-05-07T20:32:29.0686857Z         T: int,
2025-05-07T20:32:29.0687062Z         D: int,
2025-05-07T20:32:29.0687290Z         scale_ub: Optional[float],
2025-05-07T20:32:29.0687568Z         contiguous: bool,
2025-05-07T20:32:29.0687817Z         compiled: bool,
2025-05-07T20:32:29.0688056Z     ) -> None:
2025-05-07T20:32:29.0688350Z         torch.manual_seed(2025)
2025-05-07T20:32:29.0688604Z     
2025-05-07T20:32:29.0688895Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.0689242Z     
2025-05-07T20:32:29.0689447Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.0689748Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.0690063Z         x = x_sign * x_clamp
2025-05-07T20:32:29.0690313Z         x0 = x[:, :D]
2025-05-07T20:32:29.0690544Z         x1 = x[:, D:]
2025-05-07T20:32:29.0690753Z     
2025-05-07T20:32:29.0690946Z         if contiguous:
2025-05-07T20:32:29.0691190Z             x0 = x0.contiguous()
2025-05-07T20:32:29.0691466Z             x1 = x1.contiguous()
2025-05-07T20:32:29.0691708Z     
2025-05-07T20:32:29.0691912Z         if scale_ub is not None:
2025-05-07T20:32:29.0692199Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.0692542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.0692864Z             )
2025-05-07T20:32:29.0693070Z         else:
2025-05-07T20:32:29.0693290Z             scale_ub_tensor = None
2025-05-07T20:32:29.0693553Z     
2025-05-07T20:32:29.0693796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.0694125Z             op = silu_mul_quant
2025-05-07T20:32:29.0694390Z             if compiled:
2025-05-07T20:32:29.0694657Z                 op = torch.compile(op)
2025-05-07T20:32:29.0694961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0695254Z     
2025-05-07T20:32:29.0695460Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.0695628Z 
2025-05-07T20:32:29.0695730Z moe/activation_test.py:117: 
2025-05-07T20:32:29.0696040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0696388Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.0696681Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0697394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.0698118Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.0698681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.0699385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.0700078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.0700632Z     kernel = self.compile(
2025-05-07T20:32:29.0701194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.0701876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.0702290Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0702531Z 
2025-05-07T20:32:29.0702744Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7e0a250>
2025-05-07T20:32:29.0703973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.0705433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7754ae0>}
2025-05-07T20:32:29.0706820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.0707921Z context = <triton._C.libtriton.ir.context object at 0x7f30b76d5b70>
2025-05-07T20:32:29.0708221Z 
2025-05-07T20:32:29.0708400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.0708940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.0709465Z                            module_map=module_map)
2025-05-07T20:32:29.0709842Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.0710211Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.0710479Z E       ^
2025-05-07T20:32:29.0710965Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.0711429Z 
2025-05-07T20:32:29.0711872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.0712406Z 
2025-05-07T20:32:29.0712519Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.0712945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.0713629Z     T=2048,
2025-05-07T20:32:29.0713896Z     D=7168,
2025-05-07T20:32:29.0714098Z     scale_ub=None,
2025-05-07T20:32:29.0714331Z     contiguous=False,
2025-05-07T20:32:29.0714567Z     compiled=False,
2025-05-07T20:32:29.0714780Z )
2025-05-07T20:32:29.0715113Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.0715632Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.0715912Z 
2025-05-07T20:32:29.0715992Z     @given(
2025-05-07T20:32:29.0716235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.0716565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.0716888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.0717227Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.0717578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.0717875Z     )
2025-05-07T20:32:29.0718238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.0718699Z     def test_silu_mul_quant(
2025-05-07T20:32:29.0718958Z         self,
2025-05-07T20:32:29.0719161Z         T: int,
2025-05-07T20:32:29.0719376Z         D: int,
2025-05-07T20:32:29.0719605Z         scale_ub: Optional[float],
2025-05-07T20:32:29.0719882Z         contiguous: bool,
2025-05-07T20:32:29.0720201Z         compiled: bool,
2025-05-07T20:32:29.0720433Z     ) -> None:
2025-05-07T20:32:29.0720651Z         torch.manual_seed(2025)
2025-05-07T20:32:29.0720902Z     
2025-05-07T20:32:29.0721184Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.0723320Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.0725390Z 
2025-05-07T20:32:29.0725518Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.0725738Z 
2025-05-07T20:32:29.0725849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.0726274Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.0726693Z     T=128,
2025-05-07T20:32:29.0726889Z     D=7168,
2025-05-07T20:32:29.0727084Z     scale_ub=1200.0,
2025-05-07T20:32:29.0727316Z     contiguous=True,
2025-05-07T20:32:29.0727546Z     compiled=True,
2025-05-07T20:32:29.0727752Z )
2025-05-07T20:32:29.0728141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.0728658Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.0728935Z 
2025-05-07T20:32:29.0729018Z     @given(
2025-05-07T20:32:29.0729250Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.0729642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.0729967Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.0730303Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.0730646Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.0730942Z     )
2025-05-07T20:32:29.0731298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.0731759Z     def test_silu_mul_quant(
2025-05-07T20:32:29.0732010Z         self,
2025-05-07T20:32:29.0732206Z         T: int,
2025-05-07T20:32:29.0732412Z         D: int,
2025-05-07T20:32:29.0732641Z         scale_ub: Optional[float],
2025-05-07T20:32:29.0732926Z         contiguous: bool,
2025-05-07T20:32:29.0733173Z         compiled: bool,
2025-05-07T20:32:29.0733408Z     ) -> None:
2025-05-07T20:32:29.0733634Z         torch.manual_seed(2025)
2025-05-07T20:32:29.0733905Z     
2025-05-07T20:32:29.0734212Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.0734575Z     
2025-05-07T20:32:29.0734776Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.0735086Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.0735416Z         x = x_sign * x_clamp
2025-05-07T20:32:29.0735662Z         x0 = x[:, :D]
2025-05-07T20:32:29.0735892Z         x1 = x[:, D:]
2025-05-07T20:32:29.0736114Z     
2025-05-07T20:32:29.0736301Z         if contiguous:
2025-05-07T20:32:29.0736541Z             x0 = x0.contiguous()
2025-05-07T20:32:29.0736810Z             x1 = x1.contiguous()
2025-05-07T20:32:29.0737053Z     
2025-05-07T20:32:29.0737254Z         if scale_ub is not None:
2025-05-07T20:32:29.0737541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.0737891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.0738208Z             )
2025-05-07T20:32:29.0738409Z         else:
2025-05-07T20:32:29.0738630Z             scale_ub_tensor = None
2025-05-07T20:32:29.0738888Z     
2025-05-07T20:32:29.0739128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.0739462Z             op = silu_mul_quant
2025-05-07T20:32:29.0739718Z             if compiled:
2025-05-07T20:32:29.0739975Z                 op = torch.compile(op)
2025-05-07T20:32:29.0740284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0740565Z     
2025-05-07T20:32:29.0740766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.0740935Z 
2025-05-07T20:32:29.0741049Z moe/activation_test.py:117: 
2025-05-07T20:32:29.0741352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0741700Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.0741996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.0742581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.0743158Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.0744016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.0744738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.0745291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.0746005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.0746695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.0747247Z     kernel = self.compile(
2025-05-07T20:32:29.0747805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.0748544Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.0748957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.0749194Z 
2025-05-07T20:32:29.0749458Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7688cd0>
2025-05-07T20:32:29.0750586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.0752016Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7610040>}
2025-05-07T20:32:29.0753407Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.0754471Z context = <triton._C.libtriton.ir.context object at 0x7f30b78aa0b0>
2025-05-07T20:32:29.0754771Z 
2025-05-07T20:32:29.0754944Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.0755500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.0755991Z                            module_map=module_map)
2025-05-07T20:32:29.0756370Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.0756738Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.0757010Z E       ^
2025-05-07T20:32:29.0757496Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.0757965Z 
2025-05-07T20:32:29.0758406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.3898914Z 
2025-05-07T20:32:29.3899632Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3900852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3901824Z     T=128,
2025-05-07T20:32:29.3902223Z     D=7168,
2025-05-07T20:32:29.3902631Z     scale_ub=1200.0,
2025-05-07T20:32:29.3903093Z     contiguous=True,
2025-05-07T20:32:29.3903554Z     compiled=False,
2025-05-07T20:32:29.3903791Z )
2025-05-07T20:32:29.3904123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3904643Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.3904925Z 
2025-05-07T20:32:29.3905014Z     @given(
2025-05-07T20:32:29.3905257Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3905591Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3914486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3914861Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3915210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3915507Z     )
2025-05-07T20:32:29.3915883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3916359Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3916611Z         self,
2025-05-07T20:32:29.3916987Z         T: int,
2025-05-07T20:32:29.3917201Z         D: int,
2025-05-07T20:32:29.3917425Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3917716Z         contiguous: bool,
2025-05-07T20:32:29.3917974Z         compiled: bool,
2025-05-07T20:32:29.3918207Z     ) -> None:
2025-05-07T20:32:29.3918437Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3918699Z     
2025-05-07T20:32:29.3918991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3919345Z     
2025-05-07T20:32:29.3919552Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.3919928Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.3922084Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.3924125Z 
2025-05-07T20:32:29.3924250Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.3924479Z 
2025-05-07T20:32:29.3924589Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3925022Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3925444Z     T=128,
2025-05-07T20:32:29.3925632Z     D=5120,
2025-05-07T20:32:29.3925836Z     scale_ub=1200.0,
2025-05-07T20:32:29.3926070Z     contiguous=True,
2025-05-07T20:32:29.3926295Z     compiled=True,
2025-05-07T20:32:29.3926509Z )
2025-05-07T20:32:29.3926843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3927356Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.3927645Z 
2025-05-07T20:32:29.3927726Z     @given(
2025-05-07T20:32:29.3927968Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3928289Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3928611Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3928959Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3929303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3929601Z     )
2025-05-07T20:32:29.3929964Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3930428Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3930674Z         self,
2025-05-07T20:32:29.3930880Z         T: int,
2025-05-07T20:32:29.3931086Z         D: int,
2025-05-07T20:32:29.3931311Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3931596Z         contiguous: bool,
2025-05-07T20:32:29.3931852Z         compiled: bool,
2025-05-07T20:32:29.3932080Z     ) -> None:
2025-05-07T20:32:29.3932313Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3932569Z     
2025-05-07T20:32:29.3932847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3933203Z     
2025-05-07T20:32:29.3933411Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.3933710Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.3935771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.3937687Z 
2025-05-07T20:32:29.3937954Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.3938184Z 
2025-05-07T20:32:29.3938292Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.3938727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3939139Z     T=128,
2025-05-07T20:32:29.3939338Z     D=7168,
2025-05-07T20:32:29.3939544Z     scale_ub=None,
2025-05-07T20:32:29.3939762Z     contiguous=True,
2025-05-07T20:32:29.3939996Z     compiled=True,
2025-05-07T20:32:29.3940211Z )
2025-05-07T20:32:29.3940537Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.3941095Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.3941377Z 
2025-05-07T20:32:29.3941458Z     @given(
2025-05-07T20:32:29.3941703Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3942024Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.3942390Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.3942739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.3943083Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.3943403Z     )
2025-05-07T20:32:29.3943931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.3944592Z     def test_silu_mul_quant(
2025-05-07T20:32:29.3944941Z         self,
2025-05-07T20:32:29.3945232Z         T: int,
2025-05-07T20:32:29.3945524Z         D: int,
2025-05-07T20:32:29.3945829Z         scale_ub: Optional[float],
2025-05-07T20:32:29.3946234Z         contiguous: bool,
2025-05-07T20:32:29.3946574Z         compiled: bool,
2025-05-07T20:32:29.3946884Z     ) -> None:
2025-05-07T20:32:29.3947194Z         torch.manual_seed(2025)
2025-05-07T20:32:29.3947542Z     
2025-05-07T20:32:29.3947931Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3950892Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.3953571Z 
2025-05-07T20:32:29.3953738Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.3954065Z 
2025-05-07T20:32:29.3954485Z FAILED
2025-05-07T20:32:29.3954636Z 
2025-05-07T20:32:29.3954829Z =================================== FAILURES ===================================
2025-05-07T20:32:29.3955436Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:29.3956111Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:29.3957026Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:29.3957789Z   |     yield
2025-05-07T20:32:29.3958422Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:29.3959179Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:29.3959602Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:29.3960482Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:29.3961321Z   |     if method() is not None:
2025-05-07T20:32:29.3961684Z   |        ~~~~~~^^
2025-05-07T20:32:29.3962599Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:29.3963661Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.3964090Z   |            ^^^^^^^
2025-05-07T20:32:29.3965024Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:29.3965959Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:29.3966590Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:29.3967200Z   +-+---------------- 1 ----------------
2025-05-07T20:32:29.3967612Z     | Traceback (most recent call last):
2025-05-07T20:32:29.3968643Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:29.3969842Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3972837Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.3975748Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:29.3976404Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3976979Z     |     T=2048,
2025-05-07T20:32:29.3977317Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:29.3977798Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:29.3978307Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:29.3978847Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:29.3979287Z     | )
2025-05-07T20:32:29.3979545Z     | 
2025-05-07T20:32:29.3980207Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:29.3981105Z     +---------------- 2 ----------------
2025-05-07T20:32:29.3981520Z     | Traceback (most recent call last):
2025-05-07T20:32:29.3982367Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:29.3983473Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3985703Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.3987725Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:29.3988176Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.3988588Z     |     T=128,
2025-05-07T20:32:29.3988790Z     |     D=7168,
2025-05-07T20:32:29.3988999Z     |     scale_ub=None,
2025-05-07T20:32:29.3989237Z     |     contiguous=True,
2025-05-07T20:32:29.3989482Z     |     compiled=True,
2025-05-07T20:32:29.3989712Z     | )
2025-05-07T20:32:29.3989889Z     | 
2025-05-07T20:32:29.3990427Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:29.3991053Z     +---------------- 3 ----------------
2025-05-07T20:32:29.3991449Z     | Traceback (most recent call last):
2025-05-07T20:32:29.3992603Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:29.3993768Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.3996801Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.3999587Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:29.4000398Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4000969Z     |     T=128,
2025-05-07T20:32:29.4001254Z     |     D=5120,
2025-05-07T20:32:29.4001562Z     |     scale_ub=1200.0,
2025-05-07T20:32:29.4001913Z     |     contiguous=True,
2025-05-07T20:32:29.4002267Z     |     compiled=True,
2025-05-07T20:32:29.4002593Z     | )
2025-05-07T20:32:29.4002853Z     | 
2025-05-07T20:32:29.4003610Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:29.4004494Z     +---------------- 4 ----------------
2025-05-07T20:32:29.4004916Z     | Traceback (most recent call last):
2025-05-07T20:32:29.4005950Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:29.4006975Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4007401Z     |                              ~~~~~~^^
2025-05-07T20:32:29.4008330Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:29.4009347Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4010581Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:29.4011755Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4012169Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:29.4012557Z     |         a,
2025-05-07T20:32:29.4012849Z     |         ^^
2025-05-07T20:32:29.4013144Z     |     ...<23 lines>...
2025-05-07T20:32:29.4013760Z     |         USE_INT64=use_int64,
2025-05-07T20:32:29.4014295Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4014641Z     |     )
2025-05-07T20:32:29.4014916Z     |     ^
2025-05-07T20:32:29.4015671Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:29.4016741Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4017398Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4018333Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:29.4019455Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4020141Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4021067Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:29.4022086Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4022903Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:29.4023801Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:29.4024617Z     |     fn()
2025-05-07T20:32:29.4024899Z     |     ~~^^
2025-05-07T20:32:29.4025701Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:29.4026612Z     |     self.fn.run(
2025-05-07T20:32:29.4026934Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:29.4027342Z     |         *args,
2025-05-07T20:32:29.4027638Z     |         ^^^^^^
2025-05-07T20:32:29.4027944Z     |         **current,
2025-05-07T20:32:29.4028267Z     |         ^^^^^^^^^^
2025-05-07T20:32:29.4028577Z     |     )
2025-05-07T20:32:29.4028846Z     |     ^
2025-05-07T20:32:29.4029560Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:29.4030480Z     |     kernel = self.compile(
2025-05-07T20:32:29.4030851Z     |         src,
2025-05-07T20:32:29.4031160Z     |         target=target,
2025-05-07T20:32:29.4031527Z     |         options=options.__dict__,
2025-05-07T20:32:29.4031922Z     |     )
2025-05-07T20:32:29.4032704Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:29.4033718Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4034752Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:29.4035897Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4036577Z     |                        module_map=module_map)
2025-05-07T20:32:29.4037106Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4037614Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4037981Z     | ^
2025-05-07T20:32:29.4038652Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4039464Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:29.4040049Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:29.4040906Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4041532Z     |     T=1,  # or any other generated value
2025-05-07T20:32:29.4041990Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:29.4042483Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:29.4043009Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:29.4043510Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:29.4043959Z     | )
2025-05-07T20:32:29.4044235Z     | 
2025-05-07T20:32:29.4044969Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:29.4045807Z     +------------------------------------
2025-05-07T20:32:29.4046301Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:29.4046815Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4047384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4047927Z     T=1,
2025-05-07T20:32:29.4048184Z     D=5120,
2025-05-07T20:32:29.4048437Z     scale_ub=None,
2025-05-07T20:32:29.4048727Z     contiguous=True,
2025-05-07T20:32:29.4049026Z     compiled=True,
2025-05-07T20:32:29.4049301Z )
2025-05-07T20:32:29.4049735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4050406Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4050776Z 
2025-05-07T20:32:29.4050978Z     @given(
2025-05-07T20:32:29.4051301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4051742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4052165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4052632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4053094Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4053504Z     )
2025-05-07T20:32:29.4053994Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4054666Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4054998Z         self,
2025-05-07T20:32:29.4055263Z         T: int,
2025-05-07T20:32:29.4055536Z         D: int,
2025-05-07T20:32:29.4055855Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4056245Z         contiguous: bool,
2025-05-07T20:32:29.4056583Z         compiled: bool,
2025-05-07T20:32:29.4056952Z     ) -> None:
2025-05-07T20:32:29.4057245Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4057580Z     
2025-05-07T20:32:29.4057949Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4058401Z     
2025-05-07T20:32:29.4058661Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4059055Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4059468Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4059790Z         x0 = x[:, :D]
2025-05-07T20:32:29.4060083Z         x1 = x[:, D:]
2025-05-07T20:32:29.4060354Z     
2025-05-07T20:32:29.4060609Z         if contiguous:
2025-05-07T20:32:29.4060929Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4061290Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4061616Z     
2025-05-07T20:32:29.4061879Z         if scale_ub is not None:
2025-05-07T20:32:29.4062254Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4062713Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4063165Z             )
2025-05-07T20:32:29.4063458Z         else:
2025-05-07T20:32:29.4063752Z             scale_ub_tensor = None
2025-05-07T20:32:29.4064117Z     
2025-05-07T20:32:29.4064447Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4064895Z             op = silu_mul_quant
2025-05-07T20:32:29.4065260Z             if compiled:
2025-05-07T20:32:29.4065617Z                 op = torch.compile(op)
2025-05-07T20:32:29.4066036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4066441Z     
2025-05-07T20:32:29.4066719Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4067126Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4067555Z     
2025-05-07T20:32:29.4067895Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4068372Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4068787Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4069239Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4069762Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4070205Z     
2025-05-07T20:32:29.4070494Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4070771Z 
2025-05-07T20:32:29.4070925Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4071357Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4071842Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4072301Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4073424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4074482Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4075265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4076238Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4077330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4078380Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4079437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4080462Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4081346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4082162Z     fn()
2025-05-07T20:32:29.4082903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4083739Z     self.fn.run(
2025-05-07T20:32:29.4084378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4085177Z     kernel = self.compile(
2025-05-07T20:32:29.4085928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4086870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4087463Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4087802Z 
2025-05-07T20:32:29.4088098Z self = <triton.compiler.compiler.ASTSource object at 0x7f32948339d0>
2025-05-07T20:32:29.4089622Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4091578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3294bd36a0>}
2025-05-07T20:32:29.4093421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4094811Z context = <triton._C.libtriton.ir.context object at 0x7f3294fe6eb0>
2025-05-07T20:32:29.4095194Z 
2025-05-07T20:32:29.4095413Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4096128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4096768Z                            module_map=module_map)
2025-05-07T20:32:29.4097251Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4097720Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4098078Z E       ^
2025-05-07T20:32:29.4098702Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4099329Z 
2025-05-07T20:32:29.4099897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4100598Z 
2025-05-07T20:32:29.4100738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4101297Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4101837Z     T=2048,
2025-05-07T20:32:29.4102084Z     D=5120,
2025-05-07T20:32:29.4102342Z     scale_ub=1200.0,
2025-05-07T20:32:29.4102654Z     contiguous=True,
2025-05-07T20:32:29.4102954Z     compiled=False,
2025-05-07T20:32:29.4103234Z )
2025-05-07T20:32:29.4103663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4124884Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.4125319Z 
2025-05-07T20:32:29.4125433Z     @given(
2025-05-07T20:32:29.4125768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4126471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4126916Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4127398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4127870Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4128281Z     )
2025-05-07T20:32:29.4128776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4129408Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4129761Z         self,
2025-05-07T20:32:29.4130029Z         T: int,
2025-05-07T20:32:29.4130403Z         D: int,
2025-05-07T20:32:29.4130726Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4131098Z         contiguous: bool,
2025-05-07T20:32:29.4131454Z         compiled: bool,
2025-05-07T20:32:29.4131765Z     ) -> None:
2025-05-07T20:32:29.4132063Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4132410Z     
2025-05-07T20:32:29.4132891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4133373Z     
2025-05-07T20:32:29.4133647Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4134065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4134503Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4134844Z         x0 = x[:, :D]
2025-05-07T20:32:29.4135150Z         x1 = x[:, D:]
2025-05-07T20:32:29.4135459Z     
2025-05-07T20:32:29.4135724Z         if contiguous:
2025-05-07T20:32:29.4136073Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4136461Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4136807Z     
2025-05-07T20:32:29.4137095Z         if scale_ub is not None:
2025-05-07T20:32:29.4137492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4137959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4138414Z             )
2025-05-07T20:32:29.4138695Z         else:
2025-05-07T20:32:29.4138990Z             scale_ub_tensor = None
2025-05-07T20:32:29.4139371Z     
2025-05-07T20:32:29.4139715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4140174Z             op = silu_mul_quant
2025-05-07T20:32:29.4140546Z             if compiled:
2025-05-07T20:32:29.4140919Z                 op = torch.compile(op)
2025-05-07T20:32:29.4141339Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4141745Z     
2025-05-07T20:32:29.4142024Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4142258Z 
2025-05-07T20:32:29.4142411Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4142839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4143337Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4143737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4144740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4145764Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4146571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4147561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4148532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4149319Z     kernel = self.compile(
2025-05-07T20:32:29.4150118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4151094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4151669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4152000Z 
2025-05-07T20:32:29.4152277Z self = <triton.compiler.compiler.ASTSource object at 0x7f3294bc1a70>
2025-05-07T20:32:29.4153882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4155870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3294845f80>}
2025-05-07T20:32:29.4157775Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4159256Z context = <triton._C.libtriton.ir.context object at 0x7f3294394c30>
2025-05-07T20:32:29.4159661Z 
2025-05-07T20:32:29.4159902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4160777Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4161494Z                            module_map=module_map)
2025-05-07T20:32:29.4162023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4162536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4162879Z E       ^
2025-05-07T20:32:29.4163547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4164204Z 
2025-05-07T20:32:29.4164823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4165570Z 
2025-05-07T20:32:29.4165720Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4166226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4166644Z     T=2048,
2025-05-07T20:32:29.4166841Z     D=5120,
2025-05-07T20:32:29.4167038Z     scale_ub=1200.0,
2025-05-07T20:32:29.4167271Z     contiguous=True,
2025-05-07T20:32:29.4167499Z     compiled=True,
2025-05-07T20:32:29.4167707Z )
2025-05-07T20:32:29.4168042Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4168556Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4168833Z 
2025-05-07T20:32:29.4168915Z     @given(
2025-05-07T20:32:29.4169144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4169466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4169784Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4170117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4170455Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4170750Z     )
2025-05-07T20:32:29.4171103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4171553Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4171800Z         self,
2025-05-07T20:32:29.4171992Z         T: int,
2025-05-07T20:32:29.4172196Z         D: int,
2025-05-07T20:32:29.4172419Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4172698Z         contiguous: bool,
2025-05-07T20:32:29.4172943Z         compiled: bool,
2025-05-07T20:32:29.4173168Z     ) -> None:
2025-05-07T20:32:29.4173391Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4173636Z     
2025-05-07T20:32:29.4173959Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4174319Z     
2025-05-07T20:32:29.4174513Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4174812Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4175135Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4175377Z         x0 = x[:, :D]
2025-05-07T20:32:29.4175603Z         x1 = x[:, D:]
2025-05-07T20:32:29.4175824Z     
2025-05-07T20:32:29.4176007Z         if contiguous:
2025-05-07T20:32:29.4176251Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4176523Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4176763Z     
2025-05-07T20:32:29.4176965Z         if scale_ub is not None:
2025-05-07T20:32:29.4177342Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4177684Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4178003Z             )
2025-05-07T20:32:29.4178200Z         else:
2025-05-07T20:32:29.4178418Z             scale_ub_tensor = None
2025-05-07T20:32:29.4178675Z     
2025-05-07T20:32:29.4178917Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4179244Z             op = silu_mul_quant
2025-05-07T20:32:29.4179498Z             if compiled:
2025-05-07T20:32:29.4179758Z                 op = torch.compile(op)
2025-05-07T20:32:29.4180111Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4180391Z     
2025-05-07T20:32:29.4180589Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4180881Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4181175Z     
2025-05-07T20:32:29.4181419Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4181805Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4182105Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4182429Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4182798Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4183120Z     
2025-05-07T20:32:29.4183323Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4183531Z 
2025-05-07T20:32:29.4183632Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4183943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4184287Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4184626Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4185441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4186217Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4186782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4187485Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4188194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4188933Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4189686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4190349Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4190969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4191494Z     fn()
2025-05-07T20:32:29.4192018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4192626Z     self.fn.run(
2025-05-07T20:32:29.4193103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4193658Z     kernel = self.compile(
2025-05-07T20:32:29.4194268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4194940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4195348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4195589Z 
2025-05-07T20:32:29.4195803Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ef0ee00>
2025-05-07T20:32:29.4196922Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4198431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32944c9d00>}
2025-05-07T20:32:29.4199807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4200976Z context = <triton._C.libtriton.ir.context object at 0x7f328ed9f5b0>
2025-05-07T20:32:29.4201279Z 
2025-05-07T20:32:29.4201452Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4202043Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4202527Z                            module_map=module_map)
2025-05-07T20:32:29.4202897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4203309Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4203590Z E       ^
2025-05-07T20:32:29.4204094Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4204603Z 
2025-05-07T20:32:29.4205033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4205568Z 
2025-05-07T20:32:29.4205675Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4206099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4206507Z     T=16384,
2025-05-07T20:32:29.4206709Z     D=7168,
2025-05-07T20:32:29.4206906Z     scale_ub=1200.0,
2025-05-07T20:32:29.4207126Z     contiguous=False,
2025-05-07T20:32:29.4207355Z     compiled=False,
2025-05-07T20:32:29.4207559Z )
2025-05-07T20:32:29.4207881Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4208400Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4208699Z 
2025-05-07T20:32:29.4208776Z     @given(
2025-05-07T20:32:29.4209009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4209326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4209645Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4209984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4210316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4210612Z     )
2025-05-07T20:32:29.4210969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4211418Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4211663Z         self,
2025-05-07T20:32:29.4211859Z         T: int,
2025-05-07T20:32:29.4212054Z         D: int,
2025-05-07T20:32:29.4212277Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4212556Z         contiguous: bool,
2025-05-07T20:32:29.4212807Z         compiled: bool,
2025-05-07T20:32:29.4213031Z     ) -> None:
2025-05-07T20:32:29.4213257Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4213790Z     
2025-05-07T20:32:29.4214066Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4214416Z     
2025-05-07T20:32:29.4214616Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4214909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4215229Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4215474Z         x0 = x[:, :D]
2025-05-07T20:32:29.4215688Z         x1 = x[:, D:]
2025-05-07T20:32:29.4215900Z     
2025-05-07T20:32:29.4216094Z         if contiguous:
2025-05-07T20:32:29.4216325Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4216590Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4216836Z     
2025-05-07T20:32:29.4217025Z         if scale_ub is not None:
2025-05-07T20:32:29.4217304Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4217653Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4218161Z             )
2025-05-07T20:32:29.4218360Z         else:
2025-05-07T20:32:29.4218575Z             scale_ub_tensor = None
2025-05-07T20:32:29.4218832Z     
2025-05-07T20:32:29.4219062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4219387Z             op = silu_mul_quant
2025-05-07T20:32:29.4219642Z             if compiled:
2025-05-07T20:32:29.4219889Z                 op = torch.compile(op)
2025-05-07T20:32:29.4220191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4220471Z     
2025-05-07T20:32:29.4220662Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4220899Z 
2025-05-07T20:32:29.4220998Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4221298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4221631Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4221918Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4222705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4223416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4224015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4224720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4225406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4225956Z     kernel = self.compile(
2025-05-07T20:32:29.4226511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4227189Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4227318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4227325Z 
2025-05-07T20:32:29.4227542Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ef0e030>
2025-05-07T20:32:29.4228350Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4228874Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f32944a4e00>}
2025-05-07T20:32:29.4229642Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4229840Z context = <triton._C.libtriton.ir.context object at 0x7f328ebb1470>
2025-05-07T20:32:29.4229853Z 
2025-05-07T20:32:29.4230026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4230303Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4230418Z                            module_map=module_map)
2025-05-07T20:32:29.4230582Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4230682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4230767Z E       ^
2025-05-07T20:32:29.4231132Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4231138Z 
2025-05-07T20:32:29.4231571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4231579Z 
2025-05-07T20:32:29.4231683Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4231911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4231998Z     T=1,
2025-05-07T20:32:29.4232075Z     D=7168,
2025-05-07T20:32:29.4232238Z     scale_ub=None,
2025-05-07T20:32:29.4232332Z     contiguous=True,
2025-05-07T20:32:29.4232421Z     compiled=True,
2025-05-07T20:32:29.4232495Z )
2025-05-07T20:32:29.4232727Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4232894Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4232899Z 
2025-05-07T20:32:29.4232984Z     @given(
2025-05-07T20:32:29.4233105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4233211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4233377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4233496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4233611Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4233693Z     )
2025-05-07T20:32:29.4233944Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4234080Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4234173Z         self,
2025-05-07T20:32:29.4234251Z         T: int,
2025-05-07T20:32:29.4234334Z         D: int,
2025-05-07T20:32:29.4234434Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4234524Z         contiguous: bool,
2025-05-07T20:32:29.4234616Z         compiled: bool,
2025-05-07T20:32:29.4234695Z     ) -> None:
2025-05-07T20:32:29.4234791Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4234870Z     
2025-05-07T20:32:29.4235043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4235117Z     
2025-05-07T20:32:29.4235219Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4235346Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4235436Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4235527Z         x0 = x[:, :D]
2025-05-07T20:32:29.4235608Z         x1 = x[:, D:]
2025-05-07T20:32:29.4235689Z     
2025-05-07T20:32:29.4235778Z         if contiguous:
2025-05-07T20:32:29.4235869Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4235970Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4236043Z     
2025-05-07T20:32:29.4236134Z         if scale_ub is not None:
2025-05-07T20:32:29.4236247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4236384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4236463Z             )
2025-05-07T20:32:29.4236547Z         else:
2025-05-07T20:32:29.4236642Z             scale_ub_tensor = None
2025-05-07T20:32:29.4236715Z     
2025-05-07T20:32:29.4236853Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4236947Z             op = silu_mul_quant
2025-05-07T20:32:29.4237031Z             if compiled:
2025-05-07T20:32:29.4237138Z                 op = torch.compile(op)
2025-05-07T20:32:29.4237244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4237323Z     
2025-05-07T20:32:29.4237414Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4237540Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4237621Z     
2025-05-07T20:32:29.4237759Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4237862Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4237969Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4238092Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4238233Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4238314Z     
2025-05-07T20:32:29.4238415Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4238420Z 
2025-05-07T20:32:29.4238530Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4238659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4238766Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4238909Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4239567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4239672Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4240050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4240345Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4240731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4240996Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4241425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4241602Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4241951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4242078Z     fn()
2025-05-07T20:32:29.4242490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4242573Z     self.fn.run(
2025-05-07T20:32:29.4242922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4243016Z     kernel = self.compile(
2025-05-07T20:32:29.4243408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4243594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4243723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4243728Z 
2025-05-07T20:32:29.4243977Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f2a3c50>
2025-05-07T20:32:29.4244798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4245317Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328ecd3ec0>}
2025-05-07T20:32:29.4246089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4246287Z context = <triton._C.libtriton.ir.context object at 0x7f328e7afbf0>
2025-05-07T20:32:29.4246291Z 
2025-05-07T20:32:29.4246465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4246736Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4246851Z                            module_map=module_map)
2025-05-07T20:32:29.4247020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4247123Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4247204Z E       ^
2025-05-07T20:32:29.4247568Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4247573Z 
2025-05-07T20:32:29.4248000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4248004Z 
2025-05-07T20:32:29.4248118Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4248345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4248431Z     T=4096,
2025-05-07T20:32:29.4248507Z     D=5120,
2025-05-07T20:32:29.4248589Z     scale_ub=None,
2025-05-07T20:32:29.4248682Z     contiguous=False,
2025-05-07T20:32:29.4248771Z     compiled=False,
2025-05-07T20:32:29.4248845Z )
2025-05-07T20:32:29.4249152Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4249334Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4249339Z 
2025-05-07T20:32:29.4249418Z     @given(
2025-05-07T20:32:29.4249545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4249646Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4249769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4249887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4250045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4250127Z     )
2025-05-07T20:32:29.4250379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4250474Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4250557Z         self,
2025-05-07T20:32:29.4250698Z         T: int,
2025-05-07T20:32:29.4250776Z         D: int,
2025-05-07T20:32:29.4250889Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4250983Z         contiguous: bool,
2025-05-07T20:32:29.4251072Z         compiled: bool,
2025-05-07T20:32:29.4251158Z     ) -> None:
2025-05-07T20:32:29.4251254Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4251335Z     
2025-05-07T20:32:29.4251507Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4251581Z     
2025-05-07T20:32:29.4251680Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4251808Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4251898Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4251988Z         x0 = x[:, :D]
2025-05-07T20:32:29.4252070Z         x1 = x[:, D:]
2025-05-07T20:32:29.4252144Z     
2025-05-07T20:32:29.4252235Z         if contiguous:
2025-05-07T20:32:29.4252328Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4252417Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4252499Z     
2025-05-07T20:32:29.4252590Z         if scale_ub is not None:
2025-05-07T20:32:29.4252705Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4252847Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4252922Z             )
2025-05-07T20:32:29.4253005Z         else:
2025-05-07T20:32:29.4253100Z             scale_ub_tensor = None
2025-05-07T20:32:29.4253173Z     
2025-05-07T20:32:29.4253309Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4253400Z             op = silu_mul_quant
2025-05-07T20:32:29.4253487Z             if compiled:
2025-05-07T20:32:29.4253595Z                 op = torch.compile(op)
2025-05-07T20:32:29.4253704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4253777Z     
2025-05-07T20:32:29.4253873Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4253878Z 
2025-05-07T20:32:29.4253975Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4254109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4254219Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4254321Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4254843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4254941Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4255310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4255545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4255897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4256003Z     kernel = self.compile(
2025-05-07T20:32:29.4256398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4256581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4256801Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4256806Z 
2025-05-07T20:32:29.4257015Z self = <triton.compiler.compiler.ASTSource object at 0x7f329476a4e0>
2025-05-07T20:32:29.4257818Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4258339Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb49bc0>}
2025-05-07T20:32:29.4259164Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4259410Z context = <triton._C.libtriton.ir.context object at 0x7f31f9ea7e30>
2025-05-07T20:32:29.4259415Z 
2025-05-07T20:32:29.4259585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4259861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4259969Z                            module_map=module_map)
2025-05-07T20:32:29.4260133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4260238Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4260315Z E       ^
2025-05-07T20:32:29.4260677Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4260692Z 
2025-05-07T20:32:29.4261119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4261124Z 
2025-05-07T20:32:29.4261232Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4261469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4261548Z     T=4096,
2025-05-07T20:32:29.4261625Z     D=7168,
2025-05-07T20:32:29.4261715Z     scale_ub=None,
2025-05-07T20:32:29.4261803Z     contiguous=False,
2025-05-07T20:32:29.4261887Z     compiled=False,
2025-05-07T20:32:29.4261969Z )
2025-05-07T20:32:29.4262193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4262377Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4262381Z 
2025-05-07T20:32:29.4262460Z     @given(
2025-05-07T20:32:29.4262582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4262688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4262804Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4262922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4263047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4263125Z     )
2025-05-07T20:32:29.4263381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4263476Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4263556Z         self,
2025-05-07T20:32:29.4263640Z         T: int,
2025-05-07T20:32:29.4263730Z         D: int,
2025-05-07T20:32:29.4263843Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4263959Z         contiguous: bool,
2025-05-07T20:32:29.4264050Z         compiled: bool,
2025-05-07T20:32:29.4264129Z     ) -> None:
2025-05-07T20:32:29.4264231Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4264312Z     
2025-05-07T20:32:29.4264484Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4264568Z     
2025-05-07T20:32:29.4264661Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4264793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4264885Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4264966Z         x0 = x[:, :D]
2025-05-07T20:32:29.4265137Z         x1 = x[:, D:]
2025-05-07T20:32:29.4265211Z     
2025-05-07T20:32:29.4265296Z         if contiguous:
2025-05-07T20:32:29.4265396Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4265487Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4265560Z     
2025-05-07T20:32:29.4265658Z         if scale_ub is not None:
2025-05-07T20:32:29.4265764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4265903Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4265985Z             )
2025-05-07T20:32:29.4266061Z         else:
2025-05-07T20:32:29.4266203Z             scale_ub_tensor = None
2025-05-07T20:32:29.4266276Z     
2025-05-07T20:32:29.4266411Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4266509Z             op = silu_mul_quant
2025-05-07T20:32:29.4266594Z             if compiled:
2025-05-07T20:32:29.4266695Z                 op = torch.compile(op)
2025-05-07T20:32:29.4266860Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4266936Z     
2025-05-07T20:32:29.4267027Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4267032Z 
2025-05-07T20:32:29.4267137Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4267269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4267378Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4267479Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4267994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4268101Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4268471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4268699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4269064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4269159Z     kernel = self.compile(
2025-05-07T20:32:29.4269556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4269735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4269862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4269867Z 
2025-05-07T20:32:29.4270083Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ece5010>
2025-05-07T20:32:29.4270882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4271410Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb4a340>}
2025-05-07T20:32:29.4272182Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4272378Z context = <triton._C.libtriton.ir.context object at 0x7f328e87b6f0>
2025-05-07T20:32:29.4272390Z 
2025-05-07T20:32:29.4272560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4272831Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4272948Z                            module_map=module_map)
2025-05-07T20:32:29.4273110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4273209Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4273292Z E       ^
2025-05-07T20:32:29.4273814Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4273822Z 
2025-05-07T20:32:29.4274256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4274261Z 
2025-05-07T20:32:29.4274367Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4274595Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4274678Z     T=128,
2025-05-07T20:32:29.4274754Z     D=7168,
2025-05-07T20:32:29.4274836Z     scale_ub=None,
2025-05-07T20:32:29.4274929Z     contiguous=False,
2025-05-07T20:32:29.4275057Z     compiled=True,
2025-05-07T20:32:29.4275131Z )
2025-05-07T20:32:29.4275365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4275540Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4275545Z 
2025-05-07T20:32:29.4275671Z     @given(
2025-05-07T20:32:29.4275797Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4275909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4292581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4292748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4292871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4292957Z     )
2025-05-07T20:32:29.4293222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4293324Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4293412Z         self,
2025-05-07T20:32:29.4293497Z         T: int,
2025-05-07T20:32:29.4293577Z         D: int,
2025-05-07T20:32:29.4293687Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4293781Z         contiguous: bool,
2025-05-07T20:32:29.4293872Z         compiled: bool,
2025-05-07T20:32:29.4293962Z     ) -> None:
2025-05-07T20:32:29.4294060Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4294140Z     
2025-05-07T20:32:29.4294332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4294409Z     
2025-05-07T20:32:29.4294513Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4294644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4294736Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4294827Z         x0 = x[:, :D]
2025-05-07T20:32:29.4294910Z         x1 = x[:, D:]
2025-05-07T20:32:29.4294985Z     
2025-05-07T20:32:29.4295081Z         if contiguous:
2025-05-07T20:32:29.4295176Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4295268Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4295354Z     
2025-05-07T20:32:29.4295448Z         if scale_ub is not None:
2025-05-07T20:32:29.4295558Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4295708Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4295787Z             )
2025-05-07T20:32:29.4295873Z         else:
2025-05-07T20:32:29.4295974Z             scale_ub_tensor = None
2025-05-07T20:32:29.4296054Z     
2025-05-07T20:32:29.4296197Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4296290Z             op = silu_mul_quant
2025-05-07T20:32:29.4296378Z             if compiled:
2025-05-07T20:32:29.4296492Z                 op = torch.compile(op)
2025-05-07T20:32:29.4296601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4296680Z     
2025-05-07T20:32:29.4296785Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4296910Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4296983Z     
2025-05-07T20:32:29.4297135Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4297241Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4297352Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4297477Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4297623Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4297707Z     
2025-05-07T20:32:29.4297956Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4297962Z 
2025-05-07T20:32:29.4298069Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4298215Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4298329Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4298470Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4299066Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4299212Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4299594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4299829Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4300257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4300533Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4300923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4301107Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4301462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4301542Z     fn()
2025-05-07T20:32:29.4301966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4302052Z     self.fn.run(
2025-05-07T20:32:29.4302404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4302512Z     kernel = self.compile(
2025-05-07T20:32:29.4302912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4303105Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4303240Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4303244Z 
2025-05-07T20:32:29.4303459Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e1547d0>
2025-05-07T20:32:29.4304272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4304805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328f23e840>}
2025-05-07T20:32:29.4305589Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4305792Z context = <triton._C.libtriton.ir.context object at 0x7f328e429770>
2025-05-07T20:32:29.4305797Z 
2025-05-07T20:32:29.4305976Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4306254Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4306367Z                            module_map=module_map)
2025-05-07T20:32:29.4306542Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4306650Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4306728Z E       ^
2025-05-07T20:32:29.4307106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4307111Z 
2025-05-07T20:32:29.4307620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4307625Z 
2025-05-07T20:32:29.4307741Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4307974Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4308054Z     T=128,
2025-05-07T20:32:29.4308139Z     D=7168,
2025-05-07T20:32:29.4308225Z     scale_ub=None,
2025-05-07T20:32:29.4308317Z     contiguous=False,
2025-05-07T20:32:29.4308412Z     compiled=False,
2025-05-07T20:32:29.4308488Z )
2025-05-07T20:32:29.4308714Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4308945Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4308949Z 
2025-05-07T20:32:29.4309031Z     @given(
2025-05-07T20:32:29.4309162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4309264Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4309423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4309558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4309676Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4309754Z     )
2025-05-07T20:32:29.4310016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4310112Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4310204Z         self,
2025-05-07T20:32:29.4310285Z         T: int,
2025-05-07T20:32:29.4310363Z         D: int,
2025-05-07T20:32:29.4310472Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4310566Z         contiguous: bool,
2025-05-07T20:32:29.4310656Z         compiled: bool,
2025-05-07T20:32:29.4310743Z     ) -> None:
2025-05-07T20:32:29.4310840Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4310915Z     
2025-05-07T20:32:29.4311100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4311180Z     
2025-05-07T20:32:29.4311276Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4311415Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4311508Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4311593Z         x0 = x[:, :D]
2025-05-07T20:32:29.4311683Z         x1 = x[:, D:]
2025-05-07T20:32:29.4311756Z     
2025-05-07T20:32:29.4311849Z         if contiguous:
2025-05-07T20:32:29.4311944Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4312038Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4312119Z     
2025-05-07T20:32:29.4312213Z         if scale_ub is not None:
2025-05-07T20:32:29.4312320Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4312475Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4312554Z             )
2025-05-07T20:32:29.4312633Z         else:
2025-05-07T20:32:29.4312739Z             scale_ub_tensor = None
2025-05-07T20:32:29.4312814Z     
2025-05-07T20:32:29.4312948Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4313053Z             op = silu_mul_quant
2025-05-07T20:32:29.4313146Z             if compiled:
2025-05-07T20:32:29.4313256Z                 op = torch.compile(op)
2025-05-07T20:32:29.4313599Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4313714Z     
2025-05-07T20:32:29.4313860Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4313867Z 
2025-05-07T20:32:29.4313991Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4314123Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4314233Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4314333Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4314851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4314959Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4315331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4315757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4316115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4316214Z     kernel = self.compile(
2025-05-07T20:32:29.4316619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4316803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4316941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4317004Z 
2025-05-07T20:32:29.4317214Z self = <triton.compiler.compiler.ASTSource object at 0x7f328efaac90>
2025-05-07T20:32:29.4318022Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4318609Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb55f80>}
2025-05-07T20:32:29.4319377Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4319577Z context = <triton._C.libtriton.ir.context object at 0x7f328e468e70>
2025-05-07T20:32:29.4319584Z 
2025-05-07T20:32:29.4319753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4320024Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4320225Z                            module_map=module_map)
2025-05-07T20:32:29.4320391Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4320509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4320586Z E       ^
2025-05-07T20:32:29.4320949Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4320954Z 
2025-05-07T20:32:29.4321385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4321390Z 
2025-05-07T20:32:29.4321495Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4321733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4321815Z     T=4096,
2025-05-07T20:32:29.4321893Z     D=5120,
2025-05-07T20:32:29.4321987Z     scale_ub=1200.0,
2025-05-07T20:32:29.4322072Z     contiguous=True,
2025-05-07T20:32:29.4322156Z     compiled=False,
2025-05-07T20:32:29.4322231Z )
2025-05-07T20:32:29.4322455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4322643Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.4322647Z 
2025-05-07T20:32:29.4322730Z     @given(
2025-05-07T20:32:29.4322852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4322957Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4323073Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4323193Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4323316Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4323389Z     )
2025-05-07T20:32:29.4323642Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4323746Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4323823Z         self,
2025-05-07T20:32:29.4323898Z         T: int,
2025-05-07T20:32:29.4323977Z         D: int,
2025-05-07T20:32:29.4324076Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4324169Z         contiguous: bool,
2025-05-07T20:32:29.4324262Z         compiled: bool,
2025-05-07T20:32:29.4324428Z     ) -> None:
2025-05-07T20:32:29.4324530Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4324604Z     
2025-05-07T20:32:29.4324776Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4324860Z     
2025-05-07T20:32:29.4324953Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4325081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4325180Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4325263Z         x0 = x[:, :D]
2025-05-07T20:32:29.4325344Z         x1 = x[:, D:]
2025-05-07T20:32:29.4325462Z     
2025-05-07T20:32:29.4325547Z         if contiguous:
2025-05-07T20:32:29.4325637Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4325733Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4325805Z     
2025-05-07T20:32:29.4325897Z         if scale_ub is not None:
2025-05-07T20:32:29.4326011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4326221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4326305Z             )
2025-05-07T20:32:29.4326385Z         else:
2025-05-07T20:32:29.4326485Z             scale_ub_tensor = None
2025-05-07T20:32:29.4326562Z     
2025-05-07T20:32:29.4326693Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4326786Z             op = silu_mul_quant
2025-05-07T20:32:29.4326882Z             if compiled:
2025-05-07T20:32:29.4326981Z                 op = torch.compile(op)
2025-05-07T20:32:29.4327087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4327166Z     
2025-05-07T20:32:29.4327261Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4327266Z 
2025-05-07T20:32:29.4327371Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4327505Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4327609Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4327719Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4328235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4328334Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4328712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4328942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4329299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4329401Z     kernel = self.compile(
2025-05-07T20:32:29.4329796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4329983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4330111Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4330119Z 
2025-05-07T20:32:29.4330332Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecd94f0>
2025-05-07T20:32:29.4331140Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4331660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb54fe0>}
2025-05-07T20:32:29.4332437Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4332633Z context = <triton._C.libtriton.ir.context object at 0x7f328e47c830>
2025-05-07T20:32:29.4332640Z 
2025-05-07T20:32:29.4332893Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4333168Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4333276Z                            module_map=module_map)
2025-05-07T20:32:29.4333446Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4333546Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4333623Z E       ^
2025-05-07T20:32:29.4333994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4333999Z 
2025-05-07T20:32:29.4334463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4334468Z 
2025-05-07T20:32:29.4334579Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4334810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4334927Z     T=1,
2025-05-07T20:32:29.4335011Z     D=5120,
2025-05-07T20:32:29.4335101Z     scale_ub=None,
2025-05-07T20:32:29.4335188Z     contiguous=True,
2025-05-07T20:32:29.4335276Z     compiled=True,
2025-05-07T20:32:29.4335350Z )
2025-05-07T20:32:29.4335579Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4335744Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4335749Z 
2025-05-07T20:32:29.4335826Z     @given(
2025-05-07T20:32:29.4335951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4336051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4336170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4336294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4336408Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4336481Z     )
2025-05-07T20:32:29.4336745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4336849Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4336933Z         self,
2025-05-07T20:32:29.4337010Z         T: int,
2025-05-07T20:32:29.4337087Z         D: int,
2025-05-07T20:32:29.4337198Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4337288Z         contiguous: bool,
2025-05-07T20:32:29.4337374Z         compiled: bool,
2025-05-07T20:32:29.4337458Z     ) -> None:
2025-05-07T20:32:29.4337555Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4337626Z     
2025-05-07T20:32:29.4337804Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4337881Z     
2025-05-07T20:32:29.4337976Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4338107Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4338196Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4338285Z         x0 = x[:, :D]
2025-05-07T20:32:29.4338365Z         x1 = x[:, D:]
2025-05-07T20:32:29.4338441Z     
2025-05-07T20:32:29.4338532Z         if contiguous:
2025-05-07T20:32:29.4338630Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4338722Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4338801Z     
2025-05-07T20:32:29.4338892Z         if scale_ub is not None:
2025-05-07T20:32:29.4339000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4339144Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4339219Z             )
2025-05-07T20:32:29.4339296Z         else:
2025-05-07T20:32:29.4339399Z             scale_ub_tensor = None
2025-05-07T20:32:29.4339472Z     
2025-05-07T20:32:29.4339604Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4339706Z             op = silu_mul_quant
2025-05-07T20:32:29.4339792Z             if compiled:
2025-05-07T20:32:29.4339902Z                 op = torch.compile(op)
2025-05-07T20:32:29.4340009Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4340082Z     
2025-05-07T20:32:29.4340182Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4340389Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4340462Z     
2025-05-07T20:32:29.4340608Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4340711Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4340812Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4340942Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4341086Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4341165Z     
2025-05-07T20:32:29.4341267Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4341311Z 
2025-05-07T20:32:29.4341412Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4341547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4341655Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4341794Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4342411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4342514Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4342885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4343111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4343491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4343764Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4344148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4344323Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4344678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4344754Z     fn()
2025-05-07T20:32:29.4345169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4345251Z     self.fn.run(
2025-05-07T20:32:29.4345603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4345696Z     kernel = self.compile(
2025-05-07T20:32:29.4346087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4346272Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4346399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4346403Z 
2025-05-07T20:32:29.4346611Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecd9910>
2025-05-07T20:32:29.4347422Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4347941Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eb57a60>}
2025-05-07T20:32:29.4348714Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4348912Z context = <triton._C.libtriton.ir.context object at 0x7f31f99a8830>
2025-05-07T20:32:29.4348916Z 
2025-05-07T20:32:29.4349087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4349358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4349548Z                            module_map=module_map)
2025-05-07T20:32:29.4349717Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4349821Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4349896Z E       ^
2025-05-07T20:32:29.4350266Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4350271Z 
2025-05-07T20:32:29.4350692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4350735Z 
2025-05-07T20:32:29.4350844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4351071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4351147Z     T=2048,
2025-05-07T20:32:29.4351226Z     D=5120,
2025-05-07T20:32:29.4351307Z     scale_ub=None,
2025-05-07T20:32:29.4351433Z     contiguous=True,
2025-05-07T20:32:29.4351520Z     compiled=True,
2025-05-07T20:32:29.4351597Z )
2025-05-07T20:32:29.4351825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4351998Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4352003Z 
2025-05-07T20:32:29.4352080Z     @given(
2025-05-07T20:32:29.4352203Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4352303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4352418Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4352539Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4352656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4352730Z     )
2025-05-07T20:32:29.4352984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4353077Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4353157Z         self,
2025-05-07T20:32:29.4353236Z         T: int,
2025-05-07T20:32:29.4353311Z         D: int,
2025-05-07T20:32:29.4353418Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4353510Z         contiguous: bool,
2025-05-07T20:32:29.4353595Z         compiled: bool,
2025-05-07T20:32:29.4353675Z     ) -> None:
2025-05-07T20:32:29.4353770Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4353842Z     
2025-05-07T20:32:29.4354018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4354091Z     
2025-05-07T20:32:29.4354183Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4354322Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4354423Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4354504Z         x0 = x[:, :D]
2025-05-07T20:32:29.4354585Z         x1 = x[:, D:]
2025-05-07T20:32:29.4354665Z     
2025-05-07T20:32:29.4354751Z         if contiguous:
2025-05-07T20:32:29.4354843Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4354943Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4355018Z     
2025-05-07T20:32:29.4355114Z         if scale_ub is not None:
2025-05-07T20:32:29.4355229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4355366Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4355443Z             )
2025-05-07T20:32:29.4355532Z         else:
2025-05-07T20:32:29.4355629Z             scale_ub_tensor = None
2025-05-07T20:32:29.4355708Z     
2025-05-07T20:32:29.4355842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4355934Z             op = silu_mul_quant
2025-05-07T20:32:29.4356027Z             if compiled:
2025-05-07T20:32:29.4356131Z                 op = torch.compile(op)
2025-05-07T20:32:29.4356238Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4356317Z     
2025-05-07T20:32:29.4356409Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4356532Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4356617Z     
2025-05-07T20:32:29.4356756Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4356942Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4357051Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4357175Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4357325Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4357398Z     
2025-05-07T20:32:29.4357500Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4357504Z 
2025-05-07T20:32:29.4357610Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4357739Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4357992Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4358134Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4358706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4358859Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4359229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4359458Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4359839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4360103Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4360557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4360738Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4361136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4361285Z     fn()
2025-05-07T20:32:29.4361734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4361961Z     self.fn.run(
2025-05-07T20:32:29.4362414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4362540Z     kernel = self.compile(
2025-05-07T20:32:29.4362968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4363216Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4363363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4363370Z 
2025-05-07T20:32:29.4363669Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecd6840>
2025-05-07T20:32:29.4364587Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4365141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e716700>}
2025-05-07T20:32:29.4365980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4366208Z context = <triton._C.libtriton.ir.context object at 0x7f328e251570>
2025-05-07T20:32:29.4366216Z 
2025-05-07T20:32:29.4366494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4366846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4366987Z                            module_map=module_map)
2025-05-07T20:32:29.4367223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4367441Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4367555Z E       ^
2025-05-07T20:32:29.4368052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4368057Z 
2025-05-07T20:32:29.4368610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4368615Z 
2025-05-07T20:32:29.4368791Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4369056Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4369210Z     T=128,
2025-05-07T20:32:29.4369364Z     D=5120,
2025-05-07T20:32:29.4369537Z     scale_ub=None,
2025-05-07T20:32:29.4369708Z     contiguous=True,
2025-05-07T20:32:29.4369824Z     compiled=True,
2025-05-07T20:32:29.4369931Z )
2025-05-07T20:32:29.4370248Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4370491Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4370496Z 
2025-05-07T20:32:29.4370674Z     @given(
2025-05-07T20:32:29.4370880Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4371016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4371225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4371375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4371507Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4371718Z     )
2025-05-07T20:32:29.4372009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4372135Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4372303Z         self,
2025-05-07T20:32:29.4372413Z         T: int,
2025-05-07T20:32:29.4372507Z         D: int,
2025-05-07T20:32:29.4372824Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4372956Z         contiguous: bool,
2025-05-07T20:32:29.4373142Z         compiled: bool,
2025-05-07T20:32:29.4373254Z     ) -> None:
2025-05-07T20:32:29.4373379Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4373554Z     
2025-05-07T20:32:29.4373777Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4373887Z     
2025-05-07T20:32:29.4374069Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4374227Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4374347Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4374527Z         x0 = x[:, :D]
2025-05-07T20:32:29.4374658Z         x1 = x[:, D:]
2025-05-07T20:32:29.4374770Z     
2025-05-07T20:32:29.4374944Z         if contiguous:
2025-05-07T20:32:29.4375073Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4375215Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4375366Z     
2025-05-07T20:32:29.4375506Z         if scale_ub is not None:
2025-05-07T20:32:29.4375701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4375880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4375991Z             )
2025-05-07T20:32:29.4376199Z         else:
2025-05-07T20:32:29.4376376Z             scale_ub_tensor = None
2025-05-07T20:32:29.4376501Z     
2025-05-07T20:32:29.4376727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4376854Z             op = silu_mul_quant
2025-05-07T20:32:29.4377008Z             if compiled:
2025-05-07T20:32:29.4377126Z                 op = torch.compile(op)
2025-05-07T20:32:29.4377311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4377494Z     
2025-05-07T20:32:29.4377618Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4377772Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4377916Z     
2025-05-07T20:32:29.4378074Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4378259Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4378469Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4378710Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4378924Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4379037Z     
2025-05-07T20:32:29.4379155Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4379160Z 
2025-05-07T20:32:29.4379417Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4379583Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4379723Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4379999Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4380668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4380878Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4381296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4381606Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4382054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4382372Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4382810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4383065Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4383466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4383615Z     fn()
2025-05-07T20:32:29.4384061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4384200Z     self.fn.run(
2025-05-07T20:32:29.4384614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4384792Z     kernel = self.compile(
2025-05-07T20:32:29.4385265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4385477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4385660Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4385665Z 
2025-05-07T20:32:29.4385942Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f0ff8d0>
2025-05-07T20:32:29.4386766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4387431Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9bcd260>}
2025-05-07T20:32:29.4388231Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4388565Z context = <triton._C.libtriton.ir.context object at 0x7f31f952f730>
2025-05-07T20:32:29.4388570Z 
2025-05-07T20:32:29.4388767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4389068Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4389291Z                            module_map=module_map)
2025-05-07T20:32:29.4389505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4389695Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4389805Z E       ^
2025-05-07T20:32:29.4390281Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4390287Z 
2025-05-07T20:32:29.4390764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4390769Z 
2025-05-07T20:32:29.4390951Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4391280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4391389Z     T=4096,
2025-05-07T20:32:29.4391496Z     D=5120,
2025-05-07T20:32:29.4391642Z     scale_ub=None,
2025-05-07T20:32:29.4391785Z     contiguous=True,
2025-05-07T20:32:29.4391946Z     compiled=True,
2025-05-07T20:32:29.4392123Z )
2025-05-07T20:32:29.4392381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4392585Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4392631Z 
2025-05-07T20:32:29.4392774Z     @given(
2025-05-07T20:32:29.4392916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4393211Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4393360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4393507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4393687Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4393797Z     )
2025-05-07T20:32:29.4394107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4394355Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4394464Z         self,
2025-05-07T20:32:29.4394571Z         T: int,
2025-05-07T20:32:29.4394711Z         D: int,
2025-05-07T20:32:29.4394840Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4395068Z         contiguous: bool,
2025-05-07T20:32:29.4395201Z         compiled: bool,
2025-05-07T20:32:29.4395311Z     ) -> None:
2025-05-07T20:32:29.4395479Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4395586Z     
2025-05-07T20:32:29.4395814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4395991Z     
2025-05-07T20:32:29.4396129Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4396284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4396440Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4396551Z         x0 = x[:, :D]
2025-05-07T20:32:29.4396684Z         x1 = x[:, D:]
2025-05-07T20:32:29.4396857Z     
2025-05-07T20:32:29.4397031Z         if contiguous:
2025-05-07T20:32:29.4397187Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4397309Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4397434Z     
2025-05-07T20:32:29.4397576Z         if scale_ub is not None:
2025-05-07T20:32:29.4397764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4397952Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4398096Z             )
2025-05-07T20:32:29.4398202Z         else:
2025-05-07T20:32:29.4398353Z             scale_ub_tensor = None
2025-05-07T20:32:29.4398476Z     
2025-05-07T20:32:29.4398692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4398860Z             op = silu_mul_quant
2025-05-07T20:32:29.4398975Z             if compiled:
2025-05-07T20:32:29.4399128Z                 op = torch.compile(op)
2025-05-07T20:32:29.4399299Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4399386Z     
2025-05-07T20:32:29.4399559Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4399758Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4399865Z     
2025-05-07T20:32:29.4400055Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4400330Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4400445Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4400774Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4401058Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4401163Z     
2025-05-07T20:32:29.4401330Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4401335Z 
2025-05-07T20:32:29.4401464Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4401609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4401843Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4402030Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4402668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4402843Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4403247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4403577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4404068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4404397Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4404813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4405015Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4405415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4405594Z     fn()
2025-05-07T20:32:29.4406052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4406199Z     self.fn.run(
2025-05-07T20:32:29.4406579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4406785Z     kernel = self.compile(
2025-05-07T20:32:29.4407197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4407473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4407680Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4407684Z 
2025-05-07T20:32:29.4407923Z self = <triton.compiler.compiler.ASTSource object at 0x7f328eccb9d0>
2025-05-07T20:32:29.4408788Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4409339Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f96ae700>}
2025-05-07T20:32:29.4410229Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4410472Z context = <triton._C.libtriton.ir.context object at 0x7f31f930f430>
2025-05-07T20:32:29.4410477Z 
2025-05-07T20:32:29.4410675Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4411011Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4411148Z                            module_map=module_map)
2025-05-07T20:32:29.4411366Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4411564Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4411685Z E       ^
2025-05-07T20:32:29.4412112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4412120Z 
2025-05-07T20:32:29.4412655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4412660Z 
2025-05-07T20:32:29.4412819Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4413099Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4413260Z     T=16384,
2025-05-07T20:32:29.4413654Z     D=5120,
2025-05-07T20:32:29.4413804Z     scale_ub=None,
2025-05-07T20:32:29.4413987Z     contiguous=True,
2025-05-07T20:32:29.4414134Z     compiled=True,
2025-05-07T20:32:29.4414223Z )
2025-05-07T20:32:29.4414634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4414890Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4414895Z 
2025-05-07T20:32:29.4415027Z     @given(
2025-05-07T20:32:29.4415213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4415409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4415547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4415814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4415981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4416085Z     )
2025-05-07T20:32:29.4416402Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4416526Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4416618Z         self,
2025-05-07T20:32:29.4416821Z         T: int,
2025-05-07T20:32:29.4416950Z         D: int,
2025-05-07T20:32:29.4417111Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4417233Z         contiguous: bool,
2025-05-07T20:32:29.4417350Z         compiled: bool,
2025-05-07T20:32:29.4417524Z     ) -> None:
2025-05-07T20:32:29.4417684Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4417786Z     
2025-05-07T20:32:29.4418072Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4418179Z     
2025-05-07T20:32:29.4418306Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4418529Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4418684Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4418830Z         x0 = x[:, :D]
2025-05-07T20:32:29.4418939Z         x1 = x[:, D:]
2025-05-07T20:32:29.4419041Z     
2025-05-07T20:32:29.4419173Z         if contiguous:
2025-05-07T20:32:29.4419340Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4419496Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4419630Z     
2025-05-07T20:32:29.4419752Z         if scale_ub is not None:
2025-05-07T20:32:29.4419891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4420079Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4420251Z             )
2025-05-07T20:32:29.4420374Z         else:
2025-05-07T20:32:29.4420533Z             scale_ub_tensor = None
2025-05-07T20:32:29.4420638Z     
2025-05-07T20:32:29.4420832Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4420943Z             op = silu_mul_quant
2025-05-07T20:32:29.4421126Z             if compiled:
2025-05-07T20:32:29.4421343Z                 op = torch.compile(op)
2025-05-07T20:32:29.4421480Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4421582Z     
2025-05-07T20:32:29.4421736Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4421895Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4422041Z     
2025-05-07T20:32:29.4422256Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4422390Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4422556Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4422734Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4422893Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4423088Z     
2025-05-07T20:32:29.4423223Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4423228Z 
2025-05-07T20:32:29.4423488Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4423708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4423845Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4424086Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4424705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4424839Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4425291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4425618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4426049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4427067Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4427501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4427756Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4428138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4428244Z     fn()
2025-05-07T20:32:29.4428707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4428873Z     self.fn.run(
2025-05-07T20:32:29.4429322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4429447Z     kernel = self.compile(
2025-05-07T20:32:29.4429869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4430119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4430263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4430268Z 
2025-05-07T20:32:29.4430607Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f0fe4d0>
2025-05-07T20:32:29.4431459Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4432010Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f941eb60>}
2025-05-07T20:32:29.4432847Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4433074Z context = <triton._C.libtriton.ir.context object at 0x7f31f8bbdbf0>
2025-05-07T20:32:29.4433079Z 
2025-05-07T20:32:29.4433347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4433687Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4433860Z                            module_map=module_map)
2025-05-07T20:32:29.4434054Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4434187Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4434315Z E       ^
2025-05-07T20:32:29.4434797Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4434802Z 
2025-05-07T20:32:29.4435296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4435338Z 
2025-05-07T20:32:29.4435554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4435816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4435956Z     T=1,
2025-05-07T20:32:29.4436048Z     D=5120,
2025-05-07T20:32:29.4436239Z     scale_ub=1200.0,
2025-05-07T20:32:29.4436401Z     contiguous=True,
2025-05-07T20:32:29.4436516Z     compiled=True,
2025-05-07T20:32:29.4436619Z )
2025-05-07T20:32:29.4436906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4437111Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4437159Z 
2025-05-07T20:32:29.4437369Z     @given(
2025-05-07T20:32:29.4437521Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4437651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4437830Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4438042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4438186Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4443116Z     )
2025-05-07T20:32:29.4443394Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4443491Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4443575Z         self,
2025-05-07T20:32:29.4443652Z         T: int,
2025-05-07T20:32:29.4443734Z         D: int,
2025-05-07T20:32:29.4443838Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4443930Z         contiguous: bool,
2025-05-07T20:32:29.4444027Z         compiled: bool,
2025-05-07T20:32:29.4444113Z     ) -> None:
2025-05-07T20:32:29.4444208Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4444292Z     
2025-05-07T20:32:29.4444472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4444546Z     
2025-05-07T20:32:29.4444648Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4444776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4444874Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4444961Z         x0 = x[:, :D]
2025-05-07T20:32:29.4445041Z         x1 = x[:, D:]
2025-05-07T20:32:29.4445113Z     
2025-05-07T20:32:29.4445203Z         if contiguous:
2025-05-07T20:32:29.4445295Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4445390Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4445464Z     
2025-05-07T20:32:29.4445555Z         if scale_ub is not None:
2025-05-07T20:32:29.4445668Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4445805Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4445884Z             )
2025-05-07T20:32:29.4445966Z         else:
2025-05-07T20:32:29.4446061Z             scale_ub_tensor = None
2025-05-07T20:32:29.4446134Z     
2025-05-07T20:32:29.4446276Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4446368Z             op = silu_mul_quant
2025-05-07T20:32:29.4446456Z             if compiled:
2025-05-07T20:32:29.4446572Z                 op = torch.compile(op)
2025-05-07T20:32:29.4446680Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4446759Z     
2025-05-07T20:32:29.4446851Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4446856Z 
2025-05-07T20:32:29.4446955Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4447094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4447196Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4447296Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4447686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4447783Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4448299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4448398Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4448872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4449107Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4449459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4449555Z     kernel = self.compile(
2025-05-07T20:32:29.4449957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4450139Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4450315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4450320Z 
2025-05-07T20:32:29.4450531Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f961ec50>
2025-05-07T20:32:29.4451339Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4452702Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a47060>}
2025-05-07T20:32:29.4453477Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4453686Z context = <triton._C.libtriton.ir.context object at 0x7f31f8bf88f0>
2025-05-07T20:32:29.4453693Z 
2025-05-07T20:32:29.4453864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4454143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4454259Z                            module_map=module_map)
2025-05-07T20:32:29.4454433Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4454541Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4454621Z E       ^
2025-05-07T20:32:29.4454988Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4454993Z 
2025-05-07T20:32:29.4455430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4455434Z 
2025-05-07T20:32:29.4455539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4455777Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4455855Z     T=1,
2025-05-07T20:32:29.4455932Z     D=5120,
2025-05-07T20:32:29.4456019Z     scale_ub=None,
2025-05-07T20:32:29.4456109Z     contiguous=False,
2025-05-07T20:32:29.4456196Z     compiled=True,
2025-05-07T20:32:29.4456281Z )
2025-05-07T20:32:29.4456512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4456682Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4456690Z 
2025-05-07T20:32:29.4456768Z     @given(
2025-05-07T20:32:29.4456893Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4457003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4457120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4457239Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4457363Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4457441Z     )
2025-05-07T20:32:29.4457700Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4457801Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4457880Z         self,
2025-05-07T20:32:29.4457963Z         T: int,
2025-05-07T20:32:29.4458041Z         D: int,
2025-05-07T20:32:29.4458144Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4458327Z         contiguous: bool,
2025-05-07T20:32:29.4458420Z         compiled: bool,
2025-05-07T20:32:29.4458499Z     ) -> None:
2025-05-07T20:32:29.4458603Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4458677Z     
2025-05-07T20:32:29.4458852Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4458934Z     
2025-05-07T20:32:29.4459030Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4459157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4459253Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4459332Z         x0 = x[:, :D]
2025-05-07T20:32:29.4459457Z         x1 = x[:, D:]
2025-05-07T20:32:29.4459535Z     
2025-05-07T20:32:29.4459620Z         if contiguous:
2025-05-07T20:32:29.4459716Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4459807Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4459881Z     
2025-05-07T20:32:29.4460023Z         if scale_ub is not None:
2025-05-07T20:32:29.4460136Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4460275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4460356Z             )
2025-05-07T20:32:29.4460434Z         else:
2025-05-07T20:32:29.4460531Z             scale_ub_tensor = None
2025-05-07T20:32:29.4460611Z     
2025-05-07T20:32:29.4460744Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4460834Z             op = silu_mul_quant
2025-05-07T20:32:29.4460925Z             if compiled:
2025-05-07T20:32:29.4461025Z                 op = torch.compile(op)
2025-05-07T20:32:29.4461138Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4461213Z     
2025-05-07T20:32:29.4461305Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4461432Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4461504Z     
2025-05-07T20:32:29.4461641Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4461751Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4461857Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4461981Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4462131Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4462205Z     
2025-05-07T20:32:29.4462305Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4462315Z 
2025-05-07T20:32:29.4462413Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4462543Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4462655Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4462795Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4463368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4463476Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4463855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4464090Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4464470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4464739Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4465131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4465306Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4465660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4465743Z     fn()
2025-05-07T20:32:29.4466157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4466402Z     self.fn.run(
2025-05-07T20:32:29.4466762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4466860Z     kernel = self.compile(
2025-05-07T20:32:29.4467259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4467438Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4467567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4467617Z 
2025-05-07T20:32:29.4467825Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f961d150>
2025-05-07T20:32:29.4468627Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4469196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f95a6a20>}
2025-05-07T20:32:29.4469964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4470164Z context = <triton._C.libtriton.ir.context object at 0x7f31f846e070>
2025-05-07T20:32:29.4470168Z 
2025-05-07T20:32:29.4470338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4470613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4470728Z                            module_map=module_map)
2025-05-07T20:32:29.4470891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4470998Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4471076Z E       ^
2025-05-07T20:32:29.4471444Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4471449Z 
2025-05-07T20:32:29.4471880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4471885Z 
2025-05-07T20:32:29.4471989Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4472218Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4472300Z     T=1,
2025-05-07T20:32:29.4472379Z     D=5120,
2025-05-07T20:32:29.4472469Z     scale_ub=None,
2025-05-07T20:32:29.4472553Z     contiguous=True,
2025-05-07T20:32:29.4472637Z     compiled=False,
2025-05-07T20:32:29.4472715Z )
2025-05-07T20:32:29.4472937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4473104Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.4473112Z 
2025-05-07T20:32:29.4473196Z     @given(
2025-05-07T20:32:29.4473319Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4473423Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4473542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4473661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4473779Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4473853Z     )
2025-05-07T20:32:29.4474106Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4474208Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4474285Z         self,
2025-05-07T20:32:29.4474362Z         T: int,
2025-05-07T20:32:29.4474448Z         D: int,
2025-05-07T20:32:29.4474547Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4474637Z         contiguous: bool,
2025-05-07T20:32:29.4474727Z         compiled: bool,
2025-05-07T20:32:29.4474807Z     ) -> None:
2025-05-07T20:32:29.4474980Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4475059Z     
2025-05-07T20:32:29.4475235Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4475318Z     
2025-05-07T20:32:29.4475414Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4475544Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4475635Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4475719Z         x0 = x[:, :D]
2025-05-07T20:32:29.4475802Z         x1 = x[:, D:]
2025-05-07T20:32:29.4475878Z     
2025-05-07T20:32:29.4475961Z         if contiguous:
2025-05-07T20:32:29.4476094Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4476193Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4476262Z     
2025-05-07T20:32:29.4476355Z         if scale_ub is not None:
2025-05-07T20:32:29.4476467Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4476603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4476724Z             )
2025-05-07T20:32:29.4476808Z         else:
2025-05-07T20:32:29.4476904Z             scale_ub_tensor = None
2025-05-07T20:32:29.4476978Z     
2025-05-07T20:32:29.4477107Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4477196Z             op = silu_mul_quant
2025-05-07T20:32:29.4477281Z             if compiled:
2025-05-07T20:32:29.4477382Z                 op = torch.compile(op)
2025-05-07T20:32:29.4477489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4477568Z     
2025-05-07T20:32:29.4477659Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4477665Z 
2025-05-07T20:32:29.4477763Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4477894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4477994Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4478097Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4478617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4478716Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4479091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4479322Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4479673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4479773Z     kernel = self.compile(
2025-05-07T20:32:29.4480213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4480400Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4480529Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4480537Z 
2025-05-07T20:32:29.4480745Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f838fad0>
2025-05-07T20:32:29.4481554Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4482075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328f2b8860>}
2025-05-07T20:32:29.4482851Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4483047Z context = <triton._C.libtriton.ir.context object at 0x7f31f9266c30>
2025-05-07T20:32:29.4483051Z 
2025-05-07T20:32:29.4483223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4483575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4483685Z                            module_map=module_map)
2025-05-07T20:32:29.4483854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4483953Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4484028Z E       ^
2025-05-07T20:32:29.4484399Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4484404Z 
2025-05-07T20:32:29.4484831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4484874Z 
2025-05-07T20:32:29.4484987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4485216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4485296Z     T=128,
2025-05-07T20:32:29.4485416Z     D=5120,
2025-05-07T20:32:29.4485498Z     scale_ub=None,
2025-05-07T20:32:29.4485589Z     contiguous=False,
2025-05-07T20:32:29.4485676Z     compiled=True,
2025-05-07T20:32:29.4485750Z )
2025-05-07T20:32:29.4485975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4486156Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4486161Z 
2025-05-07T20:32:29.4486239Z     @given(
2025-05-07T20:32:29.4486365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4486466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4486581Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4486706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4486818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4486891Z     )
2025-05-07T20:32:29.4487145Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4487240Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4487322Z         self,
2025-05-07T20:32:29.4487402Z         T: int,
2025-05-07T20:32:29.4487477Z         D: int,
2025-05-07T20:32:29.4487577Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4487668Z         contiguous: bool,
2025-05-07T20:32:29.4487753Z         compiled: bool,
2025-05-07T20:32:29.4487837Z     ) -> None:
2025-05-07T20:32:29.4487930Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4488003Z     
2025-05-07T20:32:29.4488179Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4488252Z     
2025-05-07T20:32:29.4488345Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4488478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4488566Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4488645Z         x0 = x[:, :D]
2025-05-07T20:32:29.4488726Z         x1 = x[:, D:]
2025-05-07T20:32:29.4488799Z     
2025-05-07T20:32:29.4488886Z         if contiguous:
2025-05-07T20:32:29.4488982Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4489075Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4489153Z     
2025-05-07T20:32:29.4489244Z         if scale_ub is not None:
2025-05-07T20:32:29.4489351Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4489488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4489565Z             )
2025-05-07T20:32:29.4489639Z         else:
2025-05-07T20:32:29.4489739Z             scale_ub_tensor = None
2025-05-07T20:32:29.4489812Z     
2025-05-07T20:32:29.4489943Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4490038Z             op = silu_mul_quant
2025-05-07T20:32:29.4490126Z             if compiled:
2025-05-07T20:32:29.4490228Z                 op = torch.compile(op)
2025-05-07T20:32:29.4490335Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4490408Z     
2025-05-07T20:32:29.4490502Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4490509Z 
2025-05-07T20:32:29.4490606Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4490815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4490925Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4491024Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4491406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4491499Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4492010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4492155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4492523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4492755Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4493155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4493250Z     kernel = self.compile(
2025-05-07T20:32:29.4493648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4493833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4493961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4493965Z 
2025-05-07T20:32:29.4494179Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959db50>
2025-05-07T20:32:29.4494980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4495505Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9b8df80>}
2025-05-07T20:32:29.4496285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4496483Z context = <triton._C.libtriton.ir.context object at 0x7f31f9033d70>
2025-05-07T20:32:29.4496488Z 
2025-05-07T20:32:29.4496657Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4496926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4497040Z                            module_map=module_map)
2025-05-07T20:32:29.4497204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4497302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4497378Z E       ^
2025-05-07T20:32:29.4497741Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4497753Z 
2025-05-07T20:32:29.4498186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4498191Z 
2025-05-07T20:32:29.4498296Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4498524Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4498605Z     T=128,
2025-05-07T20:32:29.4498679Z     D=7168,
2025-05-07T20:32:29.4498761Z     scale_ub=1200.0,
2025-05-07T20:32:29.4498851Z     contiguous=False,
2025-05-07T20:32:29.4498937Z     compiled=False,
2025-05-07T20:32:29.4499011Z )
2025-05-07T20:32:29.4499234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4499413Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4499417Z 
2025-05-07T20:32:29.4499498Z     @given(
2025-05-07T20:32:29.4499623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4499824Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4499947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4500065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4500178Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4500254Z     )
2025-05-07T20:32:29.4500507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4500605Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4500681Z         self,
2025-05-07T20:32:29.4500757Z         T: int,
2025-05-07T20:32:29.4500880Z         D: int,
2025-05-07T20:32:29.4500979Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4501067Z         contiguous: bool,
2025-05-07T20:32:29.4501157Z         compiled: bool,
2025-05-07T20:32:29.4501233Z     ) -> None:
2025-05-07T20:32:29.4501327Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4501441Z     
2025-05-07T20:32:29.4501621Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4501694Z     
2025-05-07T20:32:29.4501791Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4501917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4502009Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4502088Z         x0 = x[:, :D]
2025-05-07T20:32:29.4502167Z         x1 = x[:, D:]
2025-05-07T20:32:29.4502249Z     
2025-05-07T20:32:29.4502334Z         if contiguous:
2025-05-07T20:32:29.4502431Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4502519Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4502594Z     
2025-05-07T20:32:29.4502691Z         if scale_ub is not None:
2025-05-07T20:32:29.4502796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4502931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4503011Z             )
2025-05-07T20:32:29.4503088Z         else:
2025-05-07T20:32:29.4503188Z             scale_ub_tensor = None
2025-05-07T20:32:29.4503261Z     
2025-05-07T20:32:29.4503398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4503493Z             op = silu_mul_quant
2025-05-07T20:32:29.4503577Z             if compiled:
2025-05-07T20:32:29.4503676Z                 op = torch.compile(op)
2025-05-07T20:32:29.4503785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4503857Z     
2025-05-07T20:32:29.4503947Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4503951Z 
2025-05-07T20:32:29.4504054Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4504181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4504283Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4504384Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4504896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4505000Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4505374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4505605Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4505964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4506059Z     kernel = self.compile(
2025-05-07T20:32:29.4506458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4506638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4506766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4506771Z 
2025-05-07T20:32:29.4506983Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9bda150>
2025-05-07T20:32:29.4507866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4508394Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328f033060>}
2025-05-07T20:32:29.4509165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4509397Z context = <triton._C.libtriton.ir.context object at 0x7f31f905e270>
2025-05-07T20:32:29.4509402Z 
2025-05-07T20:32:29.4509573Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4509844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4510009Z                            module_map=module_map)
2025-05-07T20:32:29.4510176Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4510275Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4510355Z E       ^
2025-05-07T20:32:29.4510716Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4510721Z 
2025-05-07T20:32:29.4511148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4511153Z 
2025-05-07T20:32:29.4511257Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4511487Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4511568Z     T=128,
2025-05-07T20:32:29.4511643Z     D=5120,
2025-05-07T20:32:29.4511725Z     scale_ub=None,
2025-05-07T20:32:29.4511815Z     contiguous=False,
2025-05-07T20:32:29.4511901Z     compiled=False,
2025-05-07T20:32:29.4511973Z )
2025-05-07T20:32:29.4512203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4512376Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4512381Z 
2025-05-07T20:32:29.4512460Z     @given(
2025-05-07T20:32:29.4512578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4512677Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4512798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4512915Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4513032Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4513113Z     )
2025-05-07T20:32:29.4513612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4513757Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4513851Z         self,
2025-05-07T20:32:29.4513930Z         T: int,
2025-05-07T20:32:29.4514013Z         D: int,
2025-05-07T20:32:29.4514117Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4514206Z         contiguous: bool,
2025-05-07T20:32:29.4514296Z         compiled: bool,
2025-05-07T20:32:29.4514374Z     ) -> None:
2025-05-07T20:32:29.4514470Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4514544Z     
2025-05-07T20:32:29.4514717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4514789Z     
2025-05-07T20:32:29.4514883Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4515008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4515096Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4515181Z         x0 = x[:, :D]
2025-05-07T20:32:29.4515263Z         x1 = x[:, D:]
2025-05-07T20:32:29.4515334Z     
2025-05-07T20:32:29.4515421Z         if contiguous:
2025-05-07T20:32:29.4515513Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4515607Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4515681Z     
2025-05-07T20:32:29.4515770Z         if scale_ub is not None:
2025-05-07T20:32:29.4516028Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4516168Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4516243Z             )
2025-05-07T20:32:29.4516320Z         else:
2025-05-07T20:32:29.4516414Z             scale_ub_tensor = None
2025-05-07T20:32:29.4516486Z     
2025-05-07T20:32:29.4516621Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4516711Z             op = silu_mul_quant
2025-05-07T20:32:29.4516795Z             if compiled:
2025-05-07T20:32:29.4516898Z                 op = torch.compile(op)
2025-05-07T20:32:29.4517062Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4517137Z     
2025-05-07T20:32:29.4517226Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4517230Z 
2025-05-07T20:32:29.4517326Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4517459Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4517623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4517723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4518243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4518340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4518714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4518945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4519297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4519397Z     kernel = self.compile(
2025-05-07T20:32:29.4519793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4519973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4520168Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4520173Z 
2025-05-07T20:32:29.4520381Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f92ff050>
2025-05-07T20:32:29.4521190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4521710Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e1872e0>}
2025-05-07T20:32:29.4522491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4522693Z context = <triton._C.libtriton.ir.context object at 0x7f31f90b52f0>
2025-05-07T20:32:29.4522698Z 
2025-05-07T20:32:29.4522864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4523138Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4523244Z                            module_map=module_map)
2025-05-07T20:32:29.4523409Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4523509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4523585Z E       ^
2025-05-07T20:32:29.4523980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4523990Z 
2025-05-07T20:32:29.4524440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4524445Z 
2025-05-07T20:32:29.4524550Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4524864Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4524942Z     T=128,
2025-05-07T20:32:29.4525022Z     D=5120,
2025-05-07T20:32:29.4525105Z     scale_ub=1200.0,
2025-05-07T20:32:29.4525193Z     contiguous=True,
2025-05-07T20:32:29.4525280Z     compiled=False,
2025-05-07T20:32:29.4525350Z )
2025-05-07T20:32:29.4525572Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4525751Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.4525755Z 
2025-05-07T20:32:29.4525832Z     @given(
2025-05-07T20:32:29.4525990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4526093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4526208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4526327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4526505Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4526577Z     )
2025-05-07T20:32:29.4526838Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4526931Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4527006Z         self,
2025-05-07T20:32:29.4527087Z         T: int,
2025-05-07T20:32:29.4527163Z         D: int,
2025-05-07T20:32:29.4527260Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4527356Z         contiguous: bool,
2025-05-07T20:32:29.4527440Z         compiled: bool,
2025-05-07T20:32:29.4527516Z     ) -> None:
2025-05-07T20:32:29.4527617Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4527691Z     
2025-05-07T20:32:29.4527869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4527942Z     
2025-05-07T20:32:29.4528033Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4528162Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4528248Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4528332Z         x0 = x[:, :D]
2025-05-07T20:32:29.4528420Z         x1 = x[:, D:]
2025-05-07T20:32:29.4528492Z     
2025-05-07T20:32:29.4528575Z         if contiguous:
2025-05-07T20:32:29.4528669Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4528757Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4528827Z     
2025-05-07T20:32:29.4528919Z         if scale_ub is not None:
2025-05-07T20:32:29.4529023Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4529161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4529236Z             )
2025-05-07T20:32:29.4529310Z         else:
2025-05-07T20:32:29.4529409Z             scale_ub_tensor = None
2025-05-07T20:32:29.4529480Z     
2025-05-07T20:32:29.4529611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4529701Z             op = silu_mul_quant
2025-05-07T20:32:29.4529785Z             if compiled:
2025-05-07T20:32:29.4529883Z                 op = torch.compile(op)
2025-05-07T20:32:29.4529994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4530070Z     
2025-05-07T20:32:29.4530161Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4530165Z 
2025-05-07T20:32:29.4530264Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4530392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4530498Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4530596Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4531107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4531212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4531582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4531810Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4532247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4532343Z     kernel = self.compile(
2025-05-07T20:32:29.4532742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4532921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4533049Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4533053Z 
2025-05-07T20:32:29.4533266Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8de9ed0>
2025-05-07T20:32:29.4534103Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4534630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e186ac0>}
2025-05-07T20:32:29.4535435Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4535629Z context = <triton._C.libtriton.ir.context object at 0x7f31f90e6ff0>
2025-05-07T20:32:29.4535638Z 
2025-05-07T20:32:29.4535804Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4536073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4536188Z                            module_map=module_map)
2025-05-07T20:32:29.4536349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4536445Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4536528Z E       ^
2025-05-07T20:32:29.4536900Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4536907Z 
2025-05-07T20:32:29.4537336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4537340Z 
2025-05-07T20:32:29.4537444Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4537671Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4537755Z     T=1,
2025-05-07T20:32:29.4537829Z     D=7168,
2025-05-07T20:32:29.4537910Z     scale_ub=1200.0,
2025-05-07T20:32:29.4538002Z     contiguous=True,
2025-05-07T20:32:29.4538086Z     compiled=True,
2025-05-07T20:32:29.4538158Z )
2025-05-07T20:32:29.4538385Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4538553Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4538557Z 
2025-05-07T20:32:29.4538635Z     @given(
2025-05-07T20:32:29.4538757Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4538860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4538978Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4539095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4539209Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4539288Z     )
2025-05-07T20:32:29.4539539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4539637Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4539711Z         self,
2025-05-07T20:32:29.4539785Z         T: int,
2025-05-07T20:32:29.4539868Z         D: int,
2025-05-07T20:32:29.4539965Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4540053Z         contiguous: bool,
2025-05-07T20:32:29.4540142Z         compiled: bool,
2025-05-07T20:32:29.4540218Z     ) -> None:
2025-05-07T20:32:29.4540313Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4540390Z     
2025-05-07T20:32:29.4540640Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4540713Z     
2025-05-07T20:32:29.4540806Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4540931Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4541017Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4541098Z         x0 = x[:, :D]
2025-05-07T20:32:29.4541175Z         x1 = x[:, D:]
2025-05-07T20:32:29.4541248Z     
2025-05-07T20:32:29.4541331Z         if contiguous:
2025-05-07T20:32:29.4541423Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4541515Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4541625Z     
2025-05-07T20:32:29.4541716Z         if scale_ub is not None:
2025-05-07T20:32:29.4541824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4541959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4542032Z             )
2025-05-07T20:32:29.4542111Z         else:
2025-05-07T20:32:29.4542247Z             scale_ub_tensor = None
2025-05-07T20:32:29.4542318Z     
2025-05-07T20:32:29.4542459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4542549Z             op = silu_mul_quant
2025-05-07T20:32:29.4542638Z             if compiled:
2025-05-07T20:32:29.4542737Z                 op = torch.compile(op)
2025-05-07T20:32:29.4542843Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4542919Z     
2025-05-07T20:32:29.4543009Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4543013Z 
2025-05-07T20:32:29.4543108Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4543236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4543340Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4543439Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4543819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4543915Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4544433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4544529Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4544895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4545126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4545476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4545573Z     kernel = self.compile(
2025-05-07T20:32:29.4545971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4546148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4546280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4546287Z 
2025-05-07T20:32:29.4546498Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f92ffed0>
2025-05-07T20:32:29.4547301Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4547823Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328eaac680>}
2025-05-07T20:32:29.4548597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4548797Z context = <triton._C.libtriton.ir.context object at 0x7f31f8368bf0>
2025-05-07T20:32:29.4548804Z 
2025-05-07T20:32:29.4548969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4549326Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4549437Z                            module_map=module_map)
2025-05-07T20:32:29.4549599Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4549698Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4549775Z E       ^
2025-05-07T20:32:29.4550137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4550142Z 
2025-05-07T20:32:29.4550608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4550612Z 
2025-05-07T20:32:29.4550716Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4550946Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4551060Z     T=1,
2025-05-07T20:32:29.4551137Z     D=7168,
2025-05-07T20:32:29.4551228Z     scale_ub=1200.0,
2025-05-07T20:32:29.4551314Z     contiguous=False,
2025-05-07T20:32:29.4551394Z     compiled=True,
2025-05-07T20:32:29.4551470Z )
2025-05-07T20:32:29.4551692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4551862Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4551871Z 
2025-05-07T20:32:29.4551946Z     @given(
2025-05-07T20:32:29.4552064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4552165Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4552281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4552396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4552512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4552583Z     )
2025-05-07T20:32:29.4552833Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4552938Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4553013Z         self,
2025-05-07T20:32:29.4553093Z         T: int,
2025-05-07T20:32:29.4553167Z         D: int,
2025-05-07T20:32:29.4553266Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4553356Z         contiguous: bool,
2025-05-07T20:32:29.4553443Z         compiled: bool,
2025-05-07T20:32:29.4553519Z     ) -> None:
2025-05-07T20:32:29.4553617Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4553689Z     
2025-05-07T20:32:29.4553861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4553944Z     
2025-05-07T20:32:29.4554055Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4554196Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4554294Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4554374Z         x0 = x[:, :D]
2025-05-07T20:32:29.4554453Z         x1 = x[:, D:]
2025-05-07T20:32:29.4554531Z     
2025-05-07T20:32:29.4554614Z         if contiguous:
2025-05-07T20:32:29.4554711Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4554799Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4554870Z     
2025-05-07T20:32:29.4554963Z         if scale_ub is not None:
2025-05-07T20:32:29.4555069Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4555204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4555281Z             )
2025-05-07T20:32:29.4555357Z         else:
2025-05-07T20:32:29.4555450Z             scale_ub_tensor = None
2025-05-07T20:32:29.4555525Z     
2025-05-07T20:32:29.4555654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4555746Z             op = silu_mul_quant
2025-05-07T20:32:29.4555832Z             if compiled:
2025-05-07T20:32:29.4555930Z                 op = torch.compile(op)
2025-05-07T20:32:29.4556036Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4556106Z     
2025-05-07T20:32:29.4556197Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4556202Z 
2025-05-07T20:32:29.4556383Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4556515Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4556616Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4556720Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4557093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4557186Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4557703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4557864Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4558240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4558469Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4558863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4558960Z     kernel = self.compile(
2025-05-07T20:32:29.4559354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4559541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4559669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4559673Z 
2025-05-07T20:32:29.4559880Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9bda050>
2025-05-07T20:32:29.4560746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4561272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e737240>}
2025-05-07T20:32:29.4562044Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4562236Z context = <triton._C.libtriton.ir.context object at 0x7f31f8394bf0>
2025-05-07T20:32:29.4562241Z 
2025-05-07T20:32:29.4562406Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4562680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4562788Z                            module_map=module_map)
2025-05-07T20:32:29.4562955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4563052Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4563130Z E       ^
2025-05-07T20:32:29.4563501Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4563505Z 
2025-05-07T20:32:29.4563956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4563960Z 
2025-05-07T20:32:29.4564082Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4564315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4564392Z     T=1,
2025-05-07T20:32:29.4564474Z     D=7168,
2025-05-07T20:32:29.4567836Z     scale_ub=None,
2025-05-07T20:32:29.4567948Z     contiguous=False,
2025-05-07T20:32:29.4568033Z     compiled=True,
2025-05-07T20:32:29.4568107Z )
2025-05-07T20:32:29.4568338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4568505Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4568513Z 
2025-05-07T20:32:29.4568596Z     @given(
2025-05-07T20:32:29.4568816Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4568918Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4569035Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4569149Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4569263Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4569341Z     )
2025-05-07T20:32:29.4569596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4569695Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4569835Z         self,
2025-05-07T20:32:29.4569911Z         T: int,
2025-05-07T20:32:29.4569987Z         D: int,
2025-05-07T20:32:29.4570083Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4570171Z         contiguous: bool,
2025-05-07T20:32:29.4570259Z         compiled: bool,
2025-05-07T20:32:29.4570337Z     ) -> None:
2025-05-07T20:32:29.4570473Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4570548Z     
2025-05-07T20:32:29.4570724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4570798Z     
2025-05-07T20:32:29.4570893Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4571020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4571111Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4571190Z         x0 = x[:, :D]
2025-05-07T20:32:29.4571267Z         x1 = x[:, D:]
2025-05-07T20:32:29.4571342Z     
2025-05-07T20:32:29.4571426Z         if contiguous:
2025-05-07T20:32:29.4571515Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4571609Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4571680Z     
2025-05-07T20:32:29.4571769Z         if scale_ub is not None:
2025-05-07T20:32:29.4571879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4572014Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4572089Z             )
2025-05-07T20:32:29.4572173Z         else:
2025-05-07T20:32:29.4572269Z             scale_ub_tensor = None
2025-05-07T20:32:29.4572340Z     
2025-05-07T20:32:29.4572475Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4572564Z             op = silu_mul_quant
2025-05-07T20:32:29.4572655Z             if compiled:
2025-05-07T20:32:29.4572754Z                 op = torch.compile(op)
2025-05-07T20:32:29.4572858Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4572931Z     
2025-05-07T20:32:29.4573020Z         y_fp8, y_scale = fn()
2025-05-07T20:32:29.4573139Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:29.4573218Z     
2025-05-07T20:32:29.4573354Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4573456Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:29.4573558Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:29.4573681Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:29.4573828Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4573903Z     
2025-05-07T20:32:29.4574003Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:29.4574007Z 
2025-05-07T20:32:29.4574107Z moe/activation_test.py:126: 
2025-05-07T20:32:29.4574236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4574340Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:29.4574477Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:29.4575052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:29.4575156Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:29.4575524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4575750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4576213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:29.4576477Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:29.4576864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:29.4577037Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:29.4577387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:29.4577503Z     fn()
2025-05-07T20:32:29.4577913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:29.4577996Z     self.fn.run(
2025-05-07T20:32:29.4578348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4578479Z     kernel = self.compile(
2025-05-07T20:32:29.4578876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4579059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4579185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4579190Z 
2025-05-07T20:32:29.4579400Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f91d8fd0>
2025-05-07T20:32:29.4580198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4580717Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f328e736980>}
2025-05-07T20:32:29.4581490Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4581685Z context = <triton._C.libtriton.ir.context object at 0x7f31f85013b0>
2025-05-07T20:32:29.4581689Z 
2025-05-07T20:32:29.4581858Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4582128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4582238Z                            module_map=module_map)
2025-05-07T20:32:29.4582403Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4582505Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:29.4582586Z E       ^
2025-05-07T20:32:29.4582947Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4582955Z 
2025-05-07T20:32:29.4583385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4583395Z 
2025-05-07T20:32:29.4583499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4583726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4583806Z     T=1,
2025-05-07T20:32:29.4583881Z     D=5120,
2025-05-07T20:32:29.4583964Z     scale_ub=1200.0,
2025-05-07T20:32:29.4584051Z     contiguous=False,
2025-05-07T20:32:29.4584133Z     compiled=True,
2025-05-07T20:32:29.4584203Z )
2025-05-07T20:32:29.4584429Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4584600Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4584605Z 
2025-05-07T20:32:29.4584684Z     @given(
2025-05-07T20:32:29.4584803Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4584904Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4585100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4585219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4585335Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4585410Z     )
2025-05-07T20:32:29.4585661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4585754Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4585833Z         self,
2025-05-07T20:32:29.4585908Z         T: int,
2025-05-07T20:32:29.4585982Z         D: int,
2025-05-07T20:32:29.4586083Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4586210Z         contiguous: bool,
2025-05-07T20:32:29.4586300Z         compiled: bool,
2025-05-07T20:32:29.4586378Z     ) -> None:
2025-05-07T20:32:29.4586474Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4586549Z     
2025-05-07T20:32:29.4586720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4586834Z     
2025-05-07T20:32:29.4586938Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4587063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4587152Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4587233Z         x0 = x[:, :D]
2025-05-07T20:32:29.4587311Z         x1 = x[:, D:]
2025-05-07T20:32:29.4587383Z     
2025-05-07T20:32:29.4587467Z         if contiguous:
2025-05-07T20:32:29.4587558Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4587650Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4587721Z     
2025-05-07T20:32:29.4587809Z         if scale_ub is not None:
2025-05-07T20:32:29.4587918Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4588054Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4588126Z             )
2025-05-07T20:32:29.4588204Z         else:
2025-05-07T20:32:29.4588296Z             scale_ub_tensor = None
2025-05-07T20:32:29.4588368Z     
2025-05-07T20:32:29.4588504Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4588597Z             op = silu_mul_quant
2025-05-07T20:32:29.4588682Z             if compiled:
2025-05-07T20:32:29.4588783Z                 op = torch.compile(op)
2025-05-07T20:32:29.4588889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4588958Z     
2025-05-07T20:32:29.4589051Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4589055Z 
2025-05-07T20:32:29.4589152Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4589283Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4589382Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4589487Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4589862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4589954Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4590463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4590567Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4590931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4591162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4591511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4591606Z     kernel = self.compile(
2025-05-07T20:32:29.4592004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4592183Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4592312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4592316Z 
2025-05-07T20:32:29.4592524Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f967ebd0>
2025-05-07T20:32:29.4593426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4593947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a45d00>}
2025-05-07T20:32:29.4594715Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4594951Z context = <triton._C.libtriton.ir.context object at 0x7f31f85a5230>
2025-05-07T20:32:29.4594955Z 
2025-05-07T20:32:29.4595122Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4595434Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4595546Z                            module_map=module_map)
2025-05-07T20:32:29.4595708Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4595807Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4595882Z E       ^
2025-05-07T20:32:29.4596246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4596250Z 
2025-05-07T20:32:29.4596678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4596685Z 
2025-05-07T20:32:29.4596789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4597017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4597092Z     T=1,
2025-05-07T20:32:29.4597165Z     D=5120,
2025-05-07T20:32:29.4597255Z     scale_ub=1200.0,
2025-05-07T20:32:29.4597340Z     contiguous=False,
2025-05-07T20:32:29.4597428Z     compiled=False,
2025-05-07T20:32:29.4597504Z )
2025-05-07T20:32:29.4597724Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4597894Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4597898Z 
2025-05-07T20:32:29.4597975Z     @given(
2025-05-07T20:32:29.4598094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4598196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4598309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4598430Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4598546Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4598619Z     )
2025-05-07T20:32:29.4598870Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4598967Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4599046Z         self,
2025-05-07T20:32:29.4599125Z         T: int,
2025-05-07T20:32:29.4599203Z         D: int,
2025-05-07T20:32:29.4599301Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4599389Z         contiguous: bool,
2025-05-07T20:32:29.4599475Z         compiled: bool,
2025-05-07T20:32:29.4599552Z     ) -> None:
2025-05-07T20:32:29.4599647Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4599717Z     
2025-05-07T20:32:29.4599888Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4599965Z     
2025-05-07T20:32:29.4600057Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4600249Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4600340Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4600419Z         x0 = x[:, :D]
2025-05-07T20:32:29.4600498Z         x1 = x[:, D:]
2025-05-07T20:32:29.4600573Z     
2025-05-07T20:32:29.4600655Z         if contiguous:
2025-05-07T20:32:29.4600747Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4600839Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4601033Z     
2025-05-07T20:32:29.4601126Z         if scale_ub is not None:
2025-05-07T20:32:29.4601236Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4601372Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4601448Z             )
2025-05-07T20:32:29.4601523Z         else:
2025-05-07T20:32:29.4601618Z             scale_ub_tensor = None
2025-05-07T20:32:29.4601690Z     
2025-05-07T20:32:29.4601819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4601908Z             op = silu_mul_quant
2025-05-07T20:32:29.4602035Z             if compiled:
2025-05-07T20:32:29.4602135Z                 op = torch.compile(op)
2025-05-07T20:32:29.4602240Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4602315Z     
2025-05-07T20:32:29.4602403Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4602407Z 
2025-05-07T20:32:29.4602546Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4602678Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4602777Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4602879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4603391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4603487Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4603860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4604089Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4604439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4604532Z     kernel = self.compile(
2025-05-07T20:32:29.4604924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4605110Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4605236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4605241Z 
2025-05-07T20:32:29.4605446Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959f550>
2025-05-07T20:32:29.4606247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4606766Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a3f240>}
2025-05-07T20:32:29.4607544Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4607742Z context = <triton._C.libtriton.ir.context object at 0x7f31f80b9730>
2025-05-07T20:32:29.4607746Z 
2025-05-07T20:32:29.4607917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4608189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4608295Z                            module_map=module_map)
2025-05-07T20:32:29.4608462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4608561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4608639Z E       ^
2025-05-07T20:32:29.4609007Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4609012Z 
2025-05-07T20:32:29.4609438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4609444Z 
2025-05-07T20:32:29.4609630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4609860Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4609936Z     T=16384,
2025-05-07T20:32:29.4610014Z     D=5120,
2025-05-07T20:32:29.4610096Z     scale_ub=1200.0,
2025-05-07T20:32:29.4610181Z     contiguous=False,
2025-05-07T20:32:29.4610266Z     compiled=True,
2025-05-07T20:32:29.4610337Z )
2025-05-07T20:32:29.4610563Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4610745Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4610788Z 
2025-05-07T20:32:29.4610863Z     @given(
2025-05-07T20:32:29.4610986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4611085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4611198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4611360Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4611479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4611550Z     )
2025-05-07T20:32:29.4611805Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4611897Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4611975Z         self,
2025-05-07T20:32:29.4612049Z         T: int,
2025-05-07T20:32:29.4612123Z         D: int,
2025-05-07T20:32:29.4612221Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4612309Z         contiguous: bool,
2025-05-07T20:32:29.4612395Z         compiled: bool,
2025-05-07T20:32:29.4612478Z     ) -> None:
2025-05-07T20:32:29.4612572Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4612643Z     
2025-05-07T20:32:29.4612817Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4612889Z     
2025-05-07T20:32:29.4612981Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4613112Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4613207Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4613288Z         x0 = x[:, :D]
2025-05-07T20:32:29.4613617Z         x1 = x[:, D:]
2025-05-07T20:32:29.4613724Z     
2025-05-07T20:32:29.4613816Z         if contiguous:
2025-05-07T20:32:29.4613908Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4613995Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4614069Z     
2025-05-07T20:32:29.4614160Z         if scale_ub is not None:
2025-05-07T20:32:29.4614265Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4614404Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4614483Z             )
2025-05-07T20:32:29.4614558Z         else:
2025-05-07T20:32:29.4614654Z             scale_ub_tensor = None
2025-05-07T20:32:29.4614726Z     
2025-05-07T20:32:29.4614861Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4614952Z             op = silu_mul_quant
2025-05-07T20:32:29.4615038Z             if compiled:
2025-05-07T20:32:29.4615144Z                 op = torch.compile(op)
2025-05-07T20:32:29.4615250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4615319Z     
2025-05-07T20:32:29.4615410Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4615415Z 
2025-05-07T20:32:29.4615513Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4615646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4615747Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4615846Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4616225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4616320Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4616827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4616930Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4617435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4617670Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4618017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4618110Z     kernel = self.compile(
2025-05-07T20:32:29.4618508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4618686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4618871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4618878Z 
2025-05-07T20:32:29.4619086Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f92feed0>
2025-05-07T20:32:29.4619889Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4620464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a3dd00>}
2025-05-07T20:32:29.4621226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4621425Z context = <triton._C.libtriton.ir.context object at 0x7f31f80aea70>
2025-05-07T20:32:29.4621429Z 
2025-05-07T20:32:29.4621596Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4621866Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4621978Z                            module_map=module_map)
2025-05-07T20:32:29.4622144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4622245Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4622319Z E       ^
2025-05-07T20:32:29.4622683Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4622687Z 
2025-05-07T20:32:29.4623115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4623120Z 
2025-05-07T20:32:29.4623222Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4623449Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4623528Z     T=2048,
2025-05-07T20:32:29.4623602Z     D=7168,
2025-05-07T20:32:29.4623687Z     scale_ub=1200.0,
2025-05-07T20:32:29.4623772Z     contiguous=False,
2025-05-07T20:32:29.4623853Z     compiled=True,
2025-05-07T20:32:29.4623930Z )
2025-05-07T20:32:29.4624157Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4624335Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4624340Z 
2025-05-07T20:32:29.4624419Z     @given(
2025-05-07T20:32:29.4624540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4624637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4624755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4624871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4624989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4625063Z     )
2025-05-07T20:32:29.4625315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4625409Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4625484Z         self,
2025-05-07T20:32:29.4625557Z         T: int,
2025-05-07T20:32:29.4625638Z         D: int,
2025-05-07T20:32:29.4625735Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4625929Z         contiguous: bool,
2025-05-07T20:32:29.4626019Z         compiled: bool,
2025-05-07T20:32:29.4626095Z     ) -> None:
2025-05-07T20:32:29.4626188Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4626262Z     
2025-05-07T20:32:29.4626434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4626508Z     
2025-05-07T20:32:29.4626599Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4626722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4626816Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4626935Z         x0 = x[:, :D]
2025-05-07T20:32:29.4627013Z         x1 = x[:, D:]
2025-05-07T20:32:29.4627091Z     
2025-05-07T20:32:29.4627174Z         if contiguous:
2025-05-07T20:32:29.4627271Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4627358Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4627428Z     
2025-05-07T20:32:29.4627563Z         if scale_ub is not None:
2025-05-07T20:32:29.4627675Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4627813Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4627887Z             )
2025-05-07T20:32:29.4627962Z         else:
2025-05-07T20:32:29.4628057Z             scale_ub_tensor = None
2025-05-07T20:32:29.4628127Z     
2025-05-07T20:32:29.4628257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4628349Z             op = silu_mul_quant
2025-05-07T20:32:29.4628430Z             if compiled:
2025-05-07T20:32:29.4628528Z                 op = torch.compile(op)
2025-05-07T20:32:29.4628637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4628706Z     
2025-05-07T20:32:29.4628796Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4628804Z 
2025-05-07T20:32:29.4628902Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4629027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4629132Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4629233Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4629610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4629704Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4630214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4630310Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4630681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4630911Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4631264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4631356Z     kernel = self.compile(
2025-05-07T20:32:29.4631758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4631938Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4632062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4632067Z 
2025-05-07T20:32:29.4632274Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959e650>
2025-05-07T20:32:29.4633068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4633587Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9a3fc40>}
2025-05-07T20:32:29.4634433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4634630Z context = <triton._C.libtriton.ir.context object at 0x7f31f80e2130>
2025-05-07T20:32:29.4634634Z 
2025-05-07T20:32:29.4634803Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4635070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4635173Z                            module_map=module_map)
2025-05-07T20:32:29.4635336Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4635471Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4635547Z E       ^
2025-05-07T20:32:29.4635909Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4635913Z 
2025-05-07T20:32:29.4636380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4636385Z 
2025-05-07T20:32:29.4636490Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4636718Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4636798Z     T=1,
2025-05-07T20:32:29.4636872Z     D=5120,
2025-05-07T20:32:29.4636953Z     scale_ub=None,
2025-05-07T20:32:29.4637039Z     contiguous=False,
2025-05-07T20:32:29.4637122Z     compiled=False,
2025-05-07T20:32:29.4637192Z )
2025-05-07T20:32:29.4637414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4637583Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4637587Z 
2025-05-07T20:32:29.4637662Z     @given(
2025-05-07T20:32:29.4637783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4637881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4637998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4638123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4638233Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4638308Z     )
2025-05-07T20:32:29.4638554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4638649Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4638725Z         self,
2025-05-07T20:32:29.4638799Z         T: int,
2025-05-07T20:32:29.4638873Z         D: int,
2025-05-07T20:32:29.4638974Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4639062Z         contiguous: bool,
2025-05-07T20:32:29.4639148Z         compiled: bool,
2025-05-07T20:32:29.4639225Z     ) -> None:
2025-05-07T20:32:29.4639317Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4639387Z     
2025-05-07T20:32:29.4639557Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4639627Z     
2025-05-07T20:32:29.4639724Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4639849Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4639936Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4640018Z         x0 = x[:, :D]
2025-05-07T20:32:29.4640096Z         x1 = x[:, D:]
2025-05-07T20:32:29.4640238Z     
2025-05-07T20:32:29.4640324Z         if contiguous:
2025-05-07T20:32:29.4640413Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4640499Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4640572Z     
2025-05-07T20:32:29.4640659Z         if scale_ub is not None:
2025-05-07T20:32:29.4640764Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4640901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4640973Z             )
2025-05-07T20:32:29.4641051Z         else:
2025-05-07T20:32:29.4641142Z             scale_ub_tensor = None
2025-05-07T20:32:29.4641212Z     
2025-05-07T20:32:29.4641344Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4641435Z             op = silu_mul_quant
2025-05-07T20:32:29.4641603Z             if compiled:
2025-05-07T20:32:29.4641706Z                 op = torch.compile(op)
2025-05-07T20:32:29.4641809Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4641878Z     
2025-05-07T20:32:29.4641971Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4641975Z 
2025-05-07T20:32:29.4642071Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4642200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4642299Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4642395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4642946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4643043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4643407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4643676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4644023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4644118Z     kernel = self.compile(
2025-05-07T20:32:29.4644509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4644685Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4644815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4644821Z 
2025-05-07T20:32:29.4645027Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a3450>
2025-05-07T20:32:29.4645824Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4646345Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f96afc40>}
2025-05-07T20:32:29.4647104Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4647299Z context = <triton._C.libtriton.ir.context object at 0x7f30b7fb80b0>
2025-05-07T20:32:29.4647304Z 
2025-05-07T20:32:29.4647471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4647740Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4647846Z                            module_map=module_map)
2025-05-07T20:32:29.4648007Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4648111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4648190Z E       ^
2025-05-07T20:32:29.4648550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4648560Z 
2025-05-07T20:32:29.4648984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4648989Z 
2025-05-07T20:32:29.4649091Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4649317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4649399Z     T=4096,
2025-05-07T20:32:29.4649474Z     D=7168,
2025-05-07T20:32:29.4649557Z     scale_ub=1200.0,
2025-05-07T20:32:29.4649641Z     contiguous=False,
2025-05-07T20:32:29.4649722Z     compiled=False,
2025-05-07T20:32:29.4649796Z )
2025-05-07T20:32:29.4650016Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4650280Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4650285Z 
2025-05-07T20:32:29.4650360Z     @given(
2025-05-07T20:32:29.4650478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4650578Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4650690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4650805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4650920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4650991Z     )
2025-05-07T20:32:29.4651240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4651376Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4651449Z         self,
2025-05-07T20:32:29.4651525Z         T: int,
2025-05-07T20:32:29.4651600Z         D: int,
2025-05-07T20:32:29.4651696Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4651786Z         contiguous: bool,
2025-05-07T20:32:29.4651932Z         compiled: bool,
2025-05-07T20:32:29.4652012Z     ) -> None:
2025-05-07T20:32:29.4652111Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4652181Z     
2025-05-07T20:32:29.4652352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4652427Z     
2025-05-07T20:32:29.4652516Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4652640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4652730Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4652808Z         x0 = x[:, :D]
2025-05-07T20:32:29.4652890Z         x1 = x[:, D:]
2025-05-07T20:32:29.4652959Z     
2025-05-07T20:32:29.4653041Z         if contiguous:
2025-05-07T20:32:29.4653134Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4653220Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4653288Z     
2025-05-07T20:32:29.4653378Z         if scale_ub is not None:
2025-05-07T20:32:29.4653482Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4653618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4653698Z             )
2025-05-07T20:32:29.4653773Z         else:
2025-05-07T20:32:29.4653867Z             scale_ub_tensor = None
2025-05-07T20:32:29.4653938Z     
2025-05-07T20:32:29.4654066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4654153Z             op = silu_mul_quant
2025-05-07T20:32:29.4654237Z             if compiled:
2025-05-07T20:32:29.4654334Z                 op = torch.compile(op)
2025-05-07T20:32:29.4654439Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4654509Z     
2025-05-07T20:32:29.4654600Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4654604Z 
2025-05-07T20:32:29.4654705Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4654829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4654929Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4655030Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4655550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4655649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4656014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4656240Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4656590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4656682Z     kernel = self.compile(
2025-05-07T20:32:29.4657075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4657256Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4657380Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4657386Z 
2025-05-07T20:32:29.4657674Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8deb5d0>
2025-05-07T20:32:29.4658470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4658986Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0ad40>}
2025-05-07T20:32:29.4659757Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4659987Z context = <triton._C.libtriton.ir.context object at 0x7f31f8c74430>
2025-05-07T20:32:29.4659992Z 
2025-05-07T20:32:29.4660199Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4660471Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4660580Z                            module_map=module_map)
2025-05-07T20:32:29.4660741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4660839Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4660915Z E       ^
2025-05-07T20:32:29.4661277Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4661282Z 
2025-05-07T20:32:29.4661707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4661712Z 
2025-05-07T20:32:29.4661815Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4662040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4662123Z     T=16384,
2025-05-07T20:32:29.4662199Z     D=7168,
2025-05-07T20:32:29.4662284Z     scale_ub=None,
2025-05-07T20:32:29.4662372Z     contiguous=True,
2025-05-07T20:32:29.4662452Z     compiled=True,
2025-05-07T20:32:29.4662522Z )
2025-05-07T20:32:29.4662745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4662920Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4662925Z 
2025-05-07T20:32:29.4662999Z     @given(
2025-05-07T20:32:29.4663120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4663216Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4663336Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4663451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4663565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4663639Z     )
2025-05-07T20:32:29.4663892Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4663995Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4664093Z         self,
2025-05-07T20:32:29.4664169Z         T: int,
2025-05-07T20:32:29.4664261Z         D: int,
2025-05-07T20:32:29.4664363Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4664450Z         contiguous: bool,
2025-05-07T20:32:29.4664534Z         compiled: bool,
2025-05-07T20:32:29.4664612Z     ) -> None:
2025-05-07T20:32:29.4664705Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4664777Z     
2025-05-07T20:32:29.4664944Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4665018Z     
2025-05-07T20:32:29.4665109Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4665230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4665316Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4665396Z         x0 = x[:, :D]
2025-05-07T20:32:29.4665474Z         x1 = x[:, D:]
2025-05-07T20:32:29.4665544Z     
2025-05-07T20:32:29.4665628Z         if contiguous:
2025-05-07T20:32:29.4665798Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4665886Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4665958Z     
2025-05-07T20:32:29.4666045Z         if scale_ub is not None:
2025-05-07T20:32:29.4666151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4666285Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4666359Z             )
2025-05-07T20:32:29.4666435Z         else:
2025-05-07T20:32:29.4666527Z             scale_ub_tensor = None
2025-05-07T20:32:29.4666598Z     
2025-05-07T20:32:29.4666729Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4666857Z             op = silu_mul_quant
2025-05-07T20:32:29.4666939Z             if compiled:
2025-05-07T20:32:29.4667039Z                 op = torch.compile(op)
2025-05-07T20:32:29.4667142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4667211Z     
2025-05-07T20:32:29.4667340Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4667344Z 
2025-05-07T20:32:29.4667443Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4667573Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4667671Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4667767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4668146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4668237Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4668744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4668847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4669212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4669441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4669797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4669889Z     kernel = self.compile(
2025-05-07T20:32:29.4670288Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4670463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4670587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4670594Z 
2025-05-07T20:32:29.4670799Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8c4b250>
2025-05-07T20:32:29.4671600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4672126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0b920>}
2025-05-07T20:32:29.4672893Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4673090Z context = <triton._C.libtriton.ir.context object at 0x7f31f8c536b0>
2025-05-07T20:32:29.4673094Z 
2025-05-07T20:32:29.4673257Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4673526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4673636Z                            module_map=module_map)
2025-05-07T20:32:29.4673796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4673896Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4673972Z E       ^
2025-05-07T20:32:29.4674465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4674469Z 
2025-05-07T20:32:29.4674898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4674902Z 
2025-05-07T20:32:29.4675004Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4675229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4675306Z     T=4096,
2025-05-07T20:32:29.4675380Z     D=5120,
2025-05-07T20:32:29.4675461Z     scale_ub=None,
2025-05-07T20:32:29.4675583Z     contiguous=False,
2025-05-07T20:32:29.4675664Z     compiled=True,
2025-05-07T20:32:29.4675738Z )
2025-05-07T20:32:29.4675956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4676129Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4676172Z 
2025-05-07T20:32:29.4676251Z     @given(
2025-05-07T20:32:29.4676372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4676468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4676583Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4676697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4676810Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4676883Z     )
2025-05-07T20:32:29.4677131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4677225Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4677300Z         self,
2025-05-07T20:32:29.4677374Z         T: int,
2025-05-07T20:32:29.4677449Z         D: int,
2025-05-07T20:32:29.4677547Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4677635Z         contiguous: bool,
2025-05-07T20:32:29.4677722Z         compiled: bool,
2025-05-07T20:32:29.4677798Z     ) -> None:
2025-05-07T20:32:29.4677895Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4677967Z     
2025-05-07T20:32:29.4678139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4678213Z     
2025-05-07T20:32:29.4678302Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4678425Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4678513Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4678591Z         x0 = x[:, :D]
2025-05-07T20:32:29.4678668Z         x1 = x[:, D:]
2025-05-07T20:32:29.4678742Z     
2025-05-07T20:32:29.4678824Z         if contiguous:
2025-05-07T20:32:29.4678912Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4679004Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4679073Z     
2025-05-07T20:32:29.4679161Z         if scale_ub is not None:
2025-05-07T20:32:29.4679266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4679400Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4679474Z             )
2025-05-07T20:32:29.4679551Z         else:
2025-05-07T20:32:29.4679647Z             scale_ub_tensor = None
2025-05-07T20:32:29.4679719Z     
2025-05-07T20:32:29.4679846Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4679933Z             op = silu_mul_quant
2025-05-07T20:32:29.4680018Z             if compiled:
2025-05-07T20:32:29.4680170Z                 op = torch.compile(op)
2025-05-07T20:32:29.4680273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4680344Z     
2025-05-07T20:32:29.4680431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4680436Z 
2025-05-07T20:32:29.4680530Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4680667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4680764Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4680866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4681240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4681333Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4681931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4682028Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4682393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4682621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4682968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4683124Z     kernel = self.compile(
2025-05-07T20:32:29.4683516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4683691Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4683860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4683869Z 
2025-05-07T20:32:29.4684073Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8850c50>
2025-05-07T20:32:29.4684870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4685383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f9be04a0>}
2025-05-07T20:32:29.4686147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4686341Z context = <triton._C.libtriton.ir.context object at 0x7f31f8c46fb0>
2025-05-07T20:32:29.4686350Z 
2025-05-07T20:32:29.4686518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4686789Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4686894Z                            module_map=module_map)
2025-05-07T20:32:29.4687053Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4687152Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4687227Z E       ^
2025-05-07T20:32:29.4687592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4687599Z 
2025-05-07T20:32:29.4688022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4688027Z 
2025-05-07T20:32:29.4688129Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4688359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4691512Z     T=4096,
2025-05-07T20:32:29.4691607Z     D=5120,
2025-05-07T20:32:29.4691691Z     scale_ub=1200.0,
2025-05-07T20:32:29.4691780Z     contiguous=False,
2025-05-07T20:32:29.4691865Z     compiled=False,
2025-05-07T20:32:29.4691935Z )
2025-05-07T20:32:29.4692167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4692349Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4692353Z 
2025-05-07T20:32:29.4692432Z     @given(
2025-05-07T20:32:29.4692551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4692651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4692769Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4692885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4692996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4693074Z     )
2025-05-07T20:32:29.4693422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4693518Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4693595Z         self,
2025-05-07T20:32:29.4693670Z         T: int,
2025-05-07T20:32:29.4693744Z         D: int,
2025-05-07T20:32:29.4693844Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4693932Z         contiguous: bool,
2025-05-07T20:32:29.4694018Z         compiled: bool,
2025-05-07T20:32:29.4694095Z     ) -> None:
2025-05-07T20:32:29.4694189Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4694263Z     
2025-05-07T20:32:29.4694434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4694547Z     
2025-05-07T20:32:29.4694640Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4694764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4694851Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4694932Z         x0 = x[:, :D]
2025-05-07T20:32:29.4695050Z         x1 = x[:, D:]
2025-05-07T20:32:29.4695119Z     
2025-05-07T20:32:29.4695211Z         if contiguous:
2025-05-07T20:32:29.4695302Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4695392Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4695461Z     
2025-05-07T20:32:29.4695550Z         if scale_ub is not None:
2025-05-07T20:32:29.4695657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4695792Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4695865Z             )
2025-05-07T20:32:29.4695941Z         else:
2025-05-07T20:32:29.4696033Z             scale_ub_tensor = None
2025-05-07T20:32:29.4696104Z     
2025-05-07T20:32:29.4696238Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4696326Z             op = silu_mul_quant
2025-05-07T20:32:29.4696409Z             if compiled:
2025-05-07T20:32:29.4696510Z                 op = torch.compile(op)
2025-05-07T20:32:29.4696614Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4696686Z     
2025-05-07T20:32:29.4696783Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4696788Z 
2025-05-07T20:32:29.4696885Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4697014Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4697113Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4697212Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4697730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4697826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4698195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4698424Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4698772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4698870Z     kernel = self.compile(
2025-05-07T20:32:29.4699267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4699444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4699574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4699578Z 
2025-05-07T20:32:29.4699784Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e0e6150>
2025-05-07T20:32:29.4700584Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4701103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0c2c0>}
2025-05-07T20:32:29.4701952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4702149Z context = <triton._C.libtriton.ir.context object at 0x7f31f861ba70>
2025-05-07T20:32:29.4702154Z 
2025-05-07T20:32:29.4702320Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4702592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4702699Z                            module_map=module_map)
2025-05-07T20:32:29.4702901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4703004Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4703078Z E       ^
2025-05-07T20:32:29.4703441Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4703484Z 
2025-05-07T20:32:29.4703912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4703917Z 
2025-05-07T20:32:29.4704029Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4704300Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4704374Z     T=4096,
2025-05-07T20:32:29.4704450Z     D=5120,
2025-05-07T20:32:29.4704532Z     scale_ub=1200.0,
2025-05-07T20:32:29.4704617Z     contiguous=False,
2025-05-07T20:32:29.4704702Z     compiled=True,
2025-05-07T20:32:29.4704772Z )
2025-05-07T20:32:29.4704995Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4705174Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4705178Z 
2025-05-07T20:32:29.4705253Z     @given(
2025-05-07T20:32:29.4705372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4705476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4705593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4705714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4705828Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4705900Z     )
2025-05-07T20:32:29.4706156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4706248Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4706321Z         self,
2025-05-07T20:32:29.4706398Z         T: int,
2025-05-07T20:32:29.4706472Z         D: int,
2025-05-07T20:32:29.4706574Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4706665Z         contiguous: bool,
2025-05-07T20:32:29.4706748Z         compiled: bool,
2025-05-07T20:32:29.4706824Z     ) -> None:
2025-05-07T20:32:29.4706921Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4706991Z     
2025-05-07T20:32:29.4707161Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4707239Z     
2025-05-07T20:32:29.4707334Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4707459Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4707545Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4707623Z         x0 = x[:, :D]
2025-05-07T20:32:29.4707705Z         x1 = x[:, D:]
2025-05-07T20:32:29.4707776Z     
2025-05-07T20:32:29.4707857Z         if contiguous:
2025-05-07T20:32:29.4707949Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4708036Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4708105Z     
2025-05-07T20:32:29.4708199Z         if scale_ub is not None:
2025-05-07T20:32:29.4708306Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4708440Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4708516Z             )
2025-05-07T20:32:29.4708590Z         else:
2025-05-07T20:32:29.4708685Z             scale_ub_tensor = None
2025-05-07T20:32:29.4708758Z     
2025-05-07T20:32:29.4708971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4709063Z             op = silu_mul_quant
2025-05-07T20:32:29.4709146Z             if compiled:
2025-05-07T20:32:29.4709243Z                 op = torch.compile(op)
2025-05-07T20:32:29.4709349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4709420Z     
2025-05-07T20:32:29.4709508Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4709512Z 
2025-05-07T20:32:29.4709610Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4709740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4709881Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4709979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4710352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4710447Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4710955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4711093Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4711462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4711690Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4712039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4712133Z     kernel = self.compile(
2025-05-07T20:32:29.4712524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4712707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4712832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4712840Z 
2025-05-07T20:32:29.4713045Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8c494d0>
2025-05-07T20:32:29.4714097Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4714618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0db20>}
2025-05-07T20:32:29.4715392Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4715588Z context = <triton._C.libtriton.ir.context object at 0x7f31f86c9930>
2025-05-07T20:32:29.4715592Z 
2025-05-07T20:32:29.4715761Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4716039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4716146Z                            module_map=module_map)
2025-05-07T20:32:29.4716311Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4716411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4716486Z E       ^
2025-05-07T20:32:29.4716852Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4716857Z 
2025-05-07T20:32:29.4717282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4717288Z 
2025-05-07T20:32:29.4717396Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4717622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4717697Z     T=2048,
2025-05-07T20:32:29.4717776Z     D=7168,
2025-05-07T20:32:29.4717857Z     scale_ub=1200.0,
2025-05-07T20:32:29.4718109Z     contiguous=False,
2025-05-07T20:32:29.4718200Z     compiled=False,
2025-05-07T20:32:29.4718271Z )
2025-05-07T20:32:29.4718497Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4718676Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4718681Z 
2025-05-07T20:32:29.4718755Z     @given(
2025-05-07T20:32:29.4718876Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4718975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4719090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4719267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4719379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4719454Z     )
2025-05-07T20:32:29.4719705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4719851Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4719930Z         self,
2025-05-07T20:32:29.4720009Z         T: int,
2025-05-07T20:32:29.4720085Z         D: int,
2025-05-07T20:32:29.4720254Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4720342Z         contiguous: bool,
2025-05-07T20:32:29.4720425Z         compiled: bool,
2025-05-07T20:32:29.4720504Z     ) -> None:
2025-05-07T20:32:29.4720598Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4720669Z     
2025-05-07T20:32:29.4720842Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4720914Z     
2025-05-07T20:32:29.4721003Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4721132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4721220Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4721302Z         x0 = x[:, :D]
2025-05-07T20:32:29.4721380Z         x1 = x[:, D:]
2025-05-07T20:32:29.4721449Z     
2025-05-07T20:32:29.4721533Z         if contiguous:
2025-05-07T20:32:29.4721625Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4721717Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4721790Z     
2025-05-07T20:32:29.4721879Z         if scale_ub is not None:
2025-05-07T20:32:29.4721984Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4722121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4722194Z             )
2025-05-07T20:32:29.4722268Z         else:
2025-05-07T20:32:29.4722365Z             scale_ub_tensor = None
2025-05-07T20:32:29.4722435Z     
2025-05-07T20:32:29.4722568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4722659Z             op = silu_mul_quant
2025-05-07T20:32:29.4722763Z             if compiled:
2025-05-07T20:32:29.4722905Z                 op = torch.compile(op)
2025-05-07T20:32:29.4723048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4723151Z     
2025-05-07T20:32:29.4723278Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4723290Z 
2025-05-07T20:32:29.4723427Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4723616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4723762Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4723899Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4724543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4724643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4725010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4725241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4725589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4725682Z     kernel = self.compile(
2025-05-07T20:32:29.4726181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4726361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4726491Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4726495Z 
2025-05-07T20:32:29.4726701Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f235ad0>
2025-05-07T20:32:29.4727498Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4728053Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0e700>}
2025-05-07T20:32:29.4728823Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4729103Z context = <triton._C.libtriton.ir.context object at 0x7f31f8643db0>
2025-05-07T20:32:29.4729107Z 
2025-05-07T20:32:29.4729272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4729546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4729653Z                            module_map=module_map)
2025-05-07T20:32:29.4729813Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4729915Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4729989Z E       ^
2025-05-07T20:32:29.4730350Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4730355Z 
2025-05-07T20:32:29.4730780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4730793Z 
2025-05-07T20:32:29.4730897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4731125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4731201Z     T=1,
2025-05-07T20:32:29.4731276Z     D=7168,
2025-05-07T20:32:29.4731360Z     scale_ub=None,
2025-05-07T20:32:29.4731446Z     contiguous=True,
2025-05-07T20:32:29.4731534Z     compiled=False,
2025-05-07T20:32:29.4731610Z )
2025-05-07T20:32:29.4731829Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4731995Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.4732006Z 
2025-05-07T20:32:29.4732082Z     @given(
2025-05-07T20:32:29.4732201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4732304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4732419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4732546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4732663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4732736Z     )
2025-05-07T20:32:29.4732986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4733080Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4733154Z         self,
2025-05-07T20:32:29.4733228Z         T: int,
2025-05-07T20:32:29.4733304Z         D: int,
2025-05-07T20:32:29.4733401Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4733493Z         contiguous: bool,
2025-05-07T20:32:29.4733577Z         compiled: bool,
2025-05-07T20:32:29.4733656Z     ) -> None:
2025-05-07T20:32:29.4733752Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4733823Z     
2025-05-07T20:32:29.4733995Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4734071Z     
2025-05-07T20:32:29.4734161Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4734287Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4734456Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4734537Z         x0 = x[:, :D]
2025-05-07T20:32:29.4734616Z         x1 = x[:, D:]
2025-05-07T20:32:29.4734689Z     
2025-05-07T20:32:29.4734771Z         if contiguous:
2025-05-07T20:32:29.4734860Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4734953Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4735025Z     
2025-05-07T20:32:29.4735117Z         if scale_ub is not None:
2025-05-07T20:32:29.4735222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4735356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4735472Z             )
2025-05-07T20:32:29.4735547Z         else:
2025-05-07T20:32:29.4735641Z             scale_ub_tensor = None
2025-05-07T20:32:29.4735714Z     
2025-05-07T20:32:29.4735843Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4735971Z             op = silu_mul_quant
2025-05-07T20:32:29.4736059Z             if compiled:
2025-05-07T20:32:29.4736163Z                 op = torch.compile(op)
2025-05-07T20:32:29.4736267Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4736340Z     
2025-05-07T20:32:29.4736429Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4736433Z 
2025-05-07T20:32:29.4736534Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4736661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4736760Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4736862Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4737372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4737472Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4737842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4738077Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4738429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4738521Z     kernel = self.compile(
2025-05-07T20:32:29.4738913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4739094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4739219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4739226Z 
2025-05-07T20:32:29.4739432Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8deaa50>
2025-05-07T20:32:29.4740229Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4740754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8b0fa60>}
2025-05-07T20:32:29.4741519Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4741712Z context = <triton._C.libtriton.ir.context object at 0x7f30b7f56a30>
2025-05-07T20:32:29.4741717Z 
2025-05-07T20:32:29.4741889Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4742162Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4742271Z                            module_map=module_map)
2025-05-07T20:32:29.4742434Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4742533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4742610Z E       ^
2025-05-07T20:32:29.4743051Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4743056Z 
2025-05-07T20:32:29.4743481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4743490Z 
2025-05-07T20:32:29.4743593Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4743844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4743929Z     T=16384,
2025-05-07T20:32:29.4744062Z     D=7168,
2025-05-07T20:32:29.4744145Z     scale_ub=1200.0,
2025-05-07T20:32:29.4744232Z     contiguous=False,
2025-05-07T20:32:29.4744313Z     compiled=True,
2025-05-07T20:32:29.4744385Z )
2025-05-07T20:32:29.4744611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4744792Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4744840Z 
2025-05-07T20:32:29.4744919Z     @given(
2025-05-07T20:32:29.4745038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4745136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4745252Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4745369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4745482Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4745556Z     )
2025-05-07T20:32:29.4745808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4745902Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4745979Z         self,
2025-05-07T20:32:29.4746054Z         T: int,
2025-05-07T20:32:29.4746129Z         D: int,
2025-05-07T20:32:29.4746230Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4746319Z         contiguous: bool,
2025-05-07T20:32:29.4746410Z         compiled: bool,
2025-05-07T20:32:29.4746486Z     ) -> None:
2025-05-07T20:32:29.4746583Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4746659Z     
2025-05-07T20:32:29.4746829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4746902Z     
2025-05-07T20:32:29.4746995Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4747121Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4747210Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4747292Z         x0 = x[:, :D]
2025-05-07T20:32:29.4747369Z         x1 = x[:, D:]
2025-05-07T20:32:29.4747439Z     
2025-05-07T20:32:29.4747524Z         if contiguous:
2025-05-07T20:32:29.4747616Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4747709Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4747779Z     
2025-05-07T20:32:29.4747869Z         if scale_ub is not None:
2025-05-07T20:32:29.4747975Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4748108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4748186Z             )
2025-05-07T20:32:29.4748267Z         else:
2025-05-07T20:32:29.4748361Z             scale_ub_tensor = None
2025-05-07T20:32:29.4748432Z     
2025-05-07T20:32:29.4748564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4748654Z             op = silu_mul_quant
2025-05-07T20:32:29.4748737Z             if compiled:
2025-05-07T20:32:29.4748843Z                 op = torch.compile(op)
2025-05-07T20:32:29.4748954Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4749026Z     
2025-05-07T20:32:29.4749116Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4749123Z 
2025-05-07T20:32:29.4749220Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4749346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4749445Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4749546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4750030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4750124Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4750634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4750734Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4751101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4751326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4751713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4751809Z     kernel = self.compile(
2025-05-07T20:32:29.4752201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4752420Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4752553Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4752557Z 
2025-05-07T20:32:29.4752764Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e6aea50>
2025-05-07T20:32:29.4753564Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4754081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7fe8d60>}
2025-05-07T20:32:29.4754849Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4755048Z context = <triton._C.libtriton.ir.context object at 0x7f30b7f34c30>
2025-05-07T20:32:29.4755052Z 
2025-05-07T20:32:29.4755218Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4755490Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4755596Z                            module_map=module_map)
2025-05-07T20:32:29.4755760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4755857Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4755932Z E       ^
2025-05-07T20:32:29.4756297Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4756304Z 
2025-05-07T20:32:29.4756727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4756731Z 
2025-05-07T20:32:29.4756840Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4757070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4757145Z     T=1,
2025-05-07T20:32:29.4757222Z     D=7168,
2025-05-07T20:32:29.4757304Z     scale_ub=None,
2025-05-07T20:32:29.4757391Z     contiguous=False,
2025-05-07T20:32:29.4757478Z     compiled=False,
2025-05-07T20:32:29.4757548Z )
2025-05-07T20:32:29.4757768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4757939Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4757944Z 
2025-05-07T20:32:29.4758018Z     @given(
2025-05-07T20:32:29.4758141Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4758238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4758351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4758471Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4758590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4758661Z     )
2025-05-07T20:32:29.4758999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4759095Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4759169Z         self,
2025-05-07T20:32:29.4759247Z         T: int,
2025-05-07T20:32:29.4759320Z         D: int,
2025-05-07T20:32:29.4759417Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4759508Z         contiguous: bool,
2025-05-07T20:32:29.4759593Z         compiled: bool,
2025-05-07T20:32:29.4759671Z     ) -> None:
2025-05-07T20:32:29.4759765Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4759875Z     
2025-05-07T20:32:29.4760054Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4760204Z     
2025-05-07T20:32:29.4760296Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4760423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4760509Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4760632Z         x0 = x[:, :D]
2025-05-07T20:32:29.4760720Z         x1 = x[:, D:]
2025-05-07T20:32:29.4760790Z     
2025-05-07T20:32:29.4760874Z         if contiguous:
2025-05-07T20:32:29.4760965Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4761054Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4761132Z     
2025-05-07T20:32:29.4761220Z         if scale_ub is not None:
2025-05-07T20:32:29.4761325Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4761461Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4761534Z             )
2025-05-07T20:32:29.4761607Z         else:
2025-05-07T20:32:29.4761704Z             scale_ub_tensor = None
2025-05-07T20:32:29.4761778Z     
2025-05-07T20:32:29.4761906Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4761998Z             op = silu_mul_quant
2025-05-07T20:32:29.4762083Z             if compiled:
2025-05-07T20:32:29.4762182Z                 op = torch.compile(op)
2025-05-07T20:32:29.4762292Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4762367Z     
2025-05-07T20:32:29.4762460Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4762464Z 
2025-05-07T20:32:29.4762560Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4762687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4762789Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4762887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4763399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4763500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4763866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4764101Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4764456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4764549Z     kernel = self.compile(
2025-05-07T20:32:29.4764946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4765124Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4765248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4765256Z 
2025-05-07T20:32:29.4765461Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f43a050>
2025-05-07T20:32:29.4766258Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4766860Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7fe9760>}
2025-05-07T20:32:29.4767629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4767825Z context = <triton._C.libtriton.ir.context object at 0x7f31f8f7bef0>
2025-05-07T20:32:29.4767830Z 
2025-05-07T20:32:29.4767994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4768263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4768410Z                            module_map=module_map)
2025-05-07T20:32:29.4768572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4768669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4768747Z E       ^
2025-05-07T20:32:29.4769114Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4769156Z 
2025-05-07T20:32:29.4769584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4769589Z 
2025-05-07T20:32:29.4769692Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4769919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4769999Z     T=2048,
2025-05-07T20:32:29.4770075Z     D=7168,
2025-05-07T20:32:29.4770159Z     scale_ub=None,
2025-05-07T20:32:29.4770244Z     contiguous=False,
2025-05-07T20:32:29.4770328Z     compiled=True,
2025-05-07T20:32:29.4770402Z )
2025-05-07T20:32:29.4770622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4770802Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4770806Z 
2025-05-07T20:32:29.4770885Z     @given(
2025-05-07T20:32:29.4771008Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4771114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4771232Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4771348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4771466Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4771539Z     )
2025-05-07T20:32:29.4771790Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4771884Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4771959Z         self,
2025-05-07T20:32:29.4772034Z         T: int,
2025-05-07T20:32:29.4772113Z         D: int,
2025-05-07T20:32:29.4772210Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4772298Z         contiguous: bool,
2025-05-07T20:32:29.4772384Z         compiled: bool,
2025-05-07T20:32:29.4772460Z     ) -> None:
2025-05-07T20:32:29.4772553Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4772629Z     
2025-05-07T20:32:29.4772803Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4772878Z     
2025-05-07T20:32:29.4772969Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4773092Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4773183Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4773262Z         x0 = x[:, :D]
2025-05-07T20:32:29.4773341Z         x1 = x[:, D:]
2025-05-07T20:32:29.4773414Z     
2025-05-07T20:32:29.4773496Z         if contiguous:
2025-05-07T20:32:29.4773589Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4773682Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4773754Z     
2025-05-07T20:32:29.4773842Z         if scale_ub is not None:
2025-05-07T20:32:29.4773949Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4774082Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4774158Z             )
2025-05-07T20:32:29.4774239Z         else:
2025-05-07T20:32:29.4774336Z             scale_ub_tensor = None
2025-05-07T20:32:29.4774409Z     
2025-05-07T20:32:29.4774624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4774717Z             op = silu_mul_quant
2025-05-07T20:32:29.4774803Z             if compiled:
2025-05-07T20:32:29.4774902Z                 op = torch.compile(op)
2025-05-07T20:32:29.4775007Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4775080Z     
2025-05-07T20:32:29.4775169Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4775173Z 
2025-05-07T20:32:29.4775268Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4775398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4775537Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4775638Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4776015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4776168Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4776686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4776784Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4777150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4777379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4777727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4777823Z     kernel = self.compile(
2025-05-07T20:32:29.4778215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4778393Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4778521Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4778529Z 
2025-05-07T20:32:29.4778739Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9802750>
2025-05-07T20:32:29.4779539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4780054Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7feaf20>}
2025-05-07T20:32:29.4780822Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4781018Z context = <triton._C.libtriton.ir.context object at 0x7f31f8f39cb0>
2025-05-07T20:32:29.4781026Z 
2025-05-07T20:32:29.4781197Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4781470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4781575Z                            module_map=module_map)
2025-05-07T20:32:29.4781736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4781837Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4781911Z E       ^
2025-05-07T20:32:29.4782276Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4782281Z 
2025-05-07T20:32:29.4782706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4782710Z 
2025-05-07T20:32:29.4782813Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4783042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4783120Z     T=4096,
2025-05-07T20:32:29.4783196Z     D=7168,
2025-05-07T20:32:29.4783359Z     scale_ub=None,
2025-05-07T20:32:29.4783445Z     contiguous=False,
2025-05-07T20:32:29.4783533Z     compiled=True,
2025-05-07T20:32:29.4783603Z )
2025-05-07T20:32:29.4783823Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4784003Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4784007Z 
2025-05-07T20:32:29.4784084Z     @given(
2025-05-07T20:32:29.4784201Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4784305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4784458Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4784574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4784690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4784761Z     )
2025-05-07T20:32:29.4785017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4785152Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4785227Z         self,
2025-05-07T20:32:29.4785306Z         T: int,
2025-05-07T20:32:29.4785379Z         D: int,
2025-05-07T20:32:29.4785476Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4785568Z         contiguous: bool,
2025-05-07T20:32:29.4785652Z         compiled: bool,
2025-05-07T20:32:29.4785728Z     ) -> None:
2025-05-07T20:32:29.4785827Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4785899Z     
2025-05-07T20:32:29.4786070Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4786149Z     
2025-05-07T20:32:29.4786239Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4786366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4786452Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4786531Z         x0 = x[:, :D]
2025-05-07T20:32:29.4786612Z         x1 = x[:, D:]
2025-05-07T20:32:29.4786686Z     
2025-05-07T20:32:29.4786767Z         if contiguous:
2025-05-07T20:32:29.4786867Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4786955Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4787027Z     
2025-05-07T20:32:29.4787118Z         if scale_ub is not None:
2025-05-07T20:32:29.4787222Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4787356Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4787432Z             )
2025-05-07T20:32:29.4787506Z         else:
2025-05-07T20:32:29.4787598Z             scale_ub_tensor = None
2025-05-07T20:32:29.4787671Z     
2025-05-07T20:32:29.4787800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4787895Z             op = silu_mul_quant
2025-05-07T20:32:29.4787978Z             if compiled:
2025-05-07T20:32:29.4788075Z                 op = torch.compile(op)
2025-05-07T20:32:29.4788183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4788253Z     
2025-05-07T20:32:29.4788347Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4788351Z 
2025-05-07T20:32:29.4788456Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4788582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4788680Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4788782Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4789158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4789253Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4789760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4789860Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4790227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4790455Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4790890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4790984Z     kernel = self.compile(
2025-05-07T20:32:29.4791375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4791556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4791682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4791686Z 
2025-05-07T20:32:29.4791892Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8853cd0>
2025-05-07T20:32:29.4792731Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4793291Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fd80e0>}
2025-05-07T20:32:29.4794107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4794299Z context = <triton._C.libtriton.ir.context object at 0x7f31f8fca7b0>
2025-05-07T20:32:29.4794304Z 
2025-05-07T20:32:29.4794471Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4794742Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4794850Z                            module_map=module_map)
2025-05-07T20:32:29.4795013Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4795111Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4795188Z E       ^
2025-05-07T20:32:29.4795565Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4795570Z 
2025-05-07T20:32:29.4795993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4795997Z 
2025-05-07T20:32:29.4796102Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4796327Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4796404Z     T=16384,
2025-05-07T20:32:29.4796482Z     D=5120,
2025-05-07T20:32:29.4796565Z     scale_ub=1200.0,
2025-05-07T20:32:29.4796653Z     contiguous=False,
2025-05-07T20:32:29.4796739Z     compiled=False,
2025-05-07T20:32:29.4796809Z )
2025-05-07T20:32:29.4797030Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4797215Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4797223Z 
2025-05-07T20:32:29.4797297Z     @given(
2025-05-07T20:32:29.4797421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4797519Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4797632Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4797751Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4797864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4797936Z     )
2025-05-07T20:32:29.4798189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4798280Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4798358Z         self,
2025-05-07T20:32:29.4798433Z         T: int,
2025-05-07T20:32:29.4798507Z         D: int,
2025-05-07T20:32:29.4798607Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4798695Z         contiguous: bool,
2025-05-07T20:32:29.4798779Z         compiled: bool,
2025-05-07T20:32:29.4798858Z     ) -> None:
2025-05-07T20:32:29.4798953Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4799023Z     
2025-05-07T20:32:29.4799275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4799348Z     
2025-05-07T20:32:29.4799438Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4799567Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4799657Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4799738Z         x0 = x[:, :D]
2025-05-07T20:32:29.4799816Z         x1 = x[:, D:]
2025-05-07T20:32:29.4799885Z     
2025-05-07T20:32:29.4799972Z         if contiguous:
2025-05-07T20:32:29.4800061Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4800279Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4800352Z     
2025-05-07T20:32:29.4800441Z         if scale_ub is not None:
2025-05-07T20:32:29.4800546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4800683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4800799Z             )
2025-05-07T20:32:29.4800873Z         else:
2025-05-07T20:32:29.4800975Z             scale_ub_tensor = None
2025-05-07T20:32:29.4801045Z     
2025-05-07T20:32:29.4801175Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4801269Z             op = silu_mul_quant
2025-05-07T20:32:29.4801351Z             if compiled:
2025-05-07T20:32:29.4801454Z                 op = torch.compile(op)
2025-05-07T20:32:29.4801557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4801627Z     
2025-05-07T20:32:29.4801718Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4801722Z 
2025-05-07T20:32:29.4801818Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4801949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4802052Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4802149Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4802662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4802769Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4803136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4803365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4803716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4803809Z     kernel = self.compile(
2025-05-07T20:32:29.4804255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4804435Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4804562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4804566Z 
2025-05-07T20:32:29.4804772Z self = <triton.compiler.compiler.ASTSource object at 0x7f328e6acc50>
2025-05-07T20:32:29.4805575Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4806096Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fd8b80>}
2025-05-07T20:32:29.4806862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4807062Z context = <triton._C.libtriton.ir.context object at 0x7f31f8ad22b0>
2025-05-07T20:32:29.4807066Z 
2025-05-07T20:32:29.4807233Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4807583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4807696Z                            module_map=module_map)
2025-05-07T20:32:29.4807858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4807957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4808031Z E       ^
2025-05-07T20:32:29.4808393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4808398Z 
2025-05-07T20:32:29.4808825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4808891Z 
2025-05-07T20:32:29.4808995Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4809224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4809300Z     T=16384,
2025-05-07T20:32:29.4809374Z     D=5120,
2025-05-07T20:32:29.4809458Z     scale_ub=1200.0,
2025-05-07T20:32:29.4809582Z     contiguous=True,
2025-05-07T20:32:29.4809667Z     compiled=True,
2025-05-07T20:32:29.4809740Z )
2025-05-07T20:32:29.4809961Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4810140Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4810144Z 
2025-05-07T20:32:29.4810222Z     @given(
2025-05-07T20:32:29.4810339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4810440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4810553Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4810671Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4810784Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4810856Z     )
2025-05-07T20:32:29.4811107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4814713Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4814827Z         self,
2025-05-07T20:32:29.4814906Z         T: int,
2025-05-07T20:32:29.4814987Z         D: int,
2025-05-07T20:32:29.4815092Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4815183Z         contiguous: bool,
2025-05-07T20:32:29.4815270Z         compiled: bool,
2025-05-07T20:32:29.4815355Z     ) -> None:
2025-05-07T20:32:29.4815449Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4815521Z     
2025-05-07T20:32:29.4815702Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4815776Z     
2025-05-07T20:32:29.4815869Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4816000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4816093Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4816177Z         x0 = x[:, :D]
2025-05-07T20:32:29.4816257Z         x1 = x[:, D:]
2025-05-07T20:32:29.4816328Z     
2025-05-07T20:32:29.4816413Z         if contiguous:
2025-05-07T20:32:29.4816505Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4816598Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4816677Z     
2025-05-07T20:32:29.4816769Z         if scale_ub is not None:
2025-05-07T20:32:29.4816877Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4817018Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4817096Z             )
2025-05-07T20:32:29.4817174Z         else:
2025-05-07T20:32:29.4817275Z             scale_ub_tensor = None
2025-05-07T20:32:29.4817347Z     
2025-05-07T20:32:29.4817479Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4817571Z             op = silu_mul_quant
2025-05-07T20:32:29.4817659Z             if compiled:
2025-05-07T20:32:29.4817764Z                 op = torch.compile(op)
2025-05-07T20:32:29.4817870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4817941Z     
2025-05-07T20:32:29.4818037Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4818042Z 
2025-05-07T20:32:29.4818142Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4818444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4818555Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4818656Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4819046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4819146Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4819655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4819758Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4820183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4820418Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4820774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4820938Z     kernel = self.compile(
2025-05-07T20:32:29.4821339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4821523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4821654Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4821658Z 
2025-05-07T20:32:29.4821875Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f961cfd0>
2025-05-07T20:32:29.4822675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4823197Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fda2a0>}
2025-05-07T20:32:29.4824017Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4824221Z context = <triton._C.libtriton.ir.context object at 0x7f31f8a85330>
2025-05-07T20:32:29.4824229Z 
2025-05-07T20:32:29.4824398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4824670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4824782Z                            module_map=module_map)
2025-05-07T20:32:29.4824955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4825056Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4825136Z E       ^
2025-05-07T20:32:29.4825505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4825513Z 
2025-05-07T20:32:29.4825947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4825951Z 
2025-05-07T20:32:29.4826061Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4826290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4826373Z     T=16384,
2025-05-07T20:32:29.4826450Z     D=5120,
2025-05-07T20:32:29.4826533Z     scale_ub=None,
2025-05-07T20:32:29.4826627Z     contiguous=False,
2025-05-07T20:32:29.4826711Z     compiled=True,
2025-05-07T20:32:29.4826785Z )
2025-05-07T20:32:29.4827012Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4827192Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4827197Z 
2025-05-07T20:32:29.4827278Z     @given(
2025-05-07T20:32:29.4827400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4827584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4827706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4827824Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4827938Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4828017Z     )
2025-05-07T20:32:29.4828272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4828365Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4828445Z         self,
2025-05-07T20:32:29.4828522Z         T: int,
2025-05-07T20:32:29.4828601Z         D: int,
2025-05-07T20:32:29.4828741Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4828831Z         contiguous: bool,
2025-05-07T20:32:29.4828919Z         compiled: bool,
2025-05-07T20:32:29.4828998Z     ) -> None:
2025-05-07T20:32:29.4829094Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4829170Z     
2025-05-07T20:32:29.4829382Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4829461Z     
2025-05-07T20:32:29.4829559Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4829686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4829777Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4829861Z         x0 = x[:, :D]
2025-05-07T20:32:29.4829940Z         x1 = x[:, D:]
2025-05-07T20:32:29.4830014Z     
2025-05-07T20:32:29.4830099Z         if contiguous:
2025-05-07T20:32:29.4830190Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4830284Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4830355Z     
2025-05-07T20:32:29.4830450Z         if scale_ub is not None:
2025-05-07T20:32:29.4830559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4830695Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4830770Z             )
2025-05-07T20:32:29.4830851Z         else:
2025-05-07T20:32:29.4830945Z             scale_ub_tensor = None
2025-05-07T20:32:29.4831021Z     
2025-05-07T20:32:29.4831164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4831257Z             op = silu_mul_quant
2025-05-07T20:32:29.4831341Z             if compiled:
2025-05-07T20:32:29.4831445Z                 op = torch.compile(op)
2025-05-07T20:32:29.4831553Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4831628Z     
2025-05-07T20:32:29.4831720Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4831724Z 
2025-05-07T20:32:29.4831825Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4831959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4832064Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4832163Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4832544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4832638Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4833159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4833259Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4833626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4833858Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4834255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4834360Z     kernel = self.compile(
2025-05-07T20:32:29.4834758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4834939Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4835071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4835077Z 
2025-05-07T20:32:29.4835365Z self = <triton.compiler.compiler.ASTSource object at 0x7f328f04a8d0>
2025-05-07T20:32:29.4836168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4836690Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f8fdb060>}
2025-05-07T20:32:29.4837459Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4837699Z context = <triton._C.libtriton.ir.context object at 0x7f31f8ac23f0>
2025-05-07T20:32:29.4837703Z 
2025-05-07T20:32:29.4837873Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4838191Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4838301Z                            module_map=module_map)
2025-05-07T20:32:29.4838465Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4838567Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4838644Z E       ^
2025-05-07T20:32:29.4839009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4839013Z 
2025-05-07T20:32:29.4839450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4839456Z 
2025-05-07T20:32:29.4839566Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4839799Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4839878Z     T=2048,
2025-05-07T20:32:29.4839958Z     D=5120,
2025-05-07T20:32:29.4840049Z     scale_ub=None,
2025-05-07T20:32:29.4840220Z     contiguous=False,
2025-05-07T20:32:29.4840304Z     compiled=True,
2025-05-07T20:32:29.4840380Z )
2025-05-07T20:32:29.4840604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4840782Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.4840790Z 
2025-05-07T20:32:29.4840868Z     @given(
2025-05-07T20:32:29.4840990Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4841094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4841213Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4841334Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4841453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4841527Z     )
2025-05-07T20:32:29.4841780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4841881Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4841963Z         self,
2025-05-07T20:32:29.4842040Z         T: int,
2025-05-07T20:32:29.4842118Z         D: int,
2025-05-07T20:32:29.4842217Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4842312Z         contiguous: bool,
2025-05-07T20:32:29.4842399Z         compiled: bool,
2025-05-07T20:32:29.4842476Z     ) -> None:
2025-05-07T20:32:29.4842575Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4842647Z     
2025-05-07T20:32:29.4842820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4842899Z     
2025-05-07T20:32:29.4842991Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4843119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4843210Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4843291Z         x0 = x[:, :D]
2025-05-07T20:32:29.4843370Z         x1 = x[:, D:]
2025-05-07T20:32:29.4843445Z     
2025-05-07T20:32:29.4843532Z         if contiguous:
2025-05-07T20:32:29.4843626Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4843827Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4843899Z     
2025-05-07T20:32:29.4843994Z         if scale_ub is not None:
2025-05-07T20:32:29.4844102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4844238Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4844318Z             )
2025-05-07T20:32:29.4844394Z         else:
2025-05-07T20:32:29.4844490Z             scale_ub_tensor = None
2025-05-07T20:32:29.4844566Z     
2025-05-07T20:32:29.4844699Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4844829Z             op = silu_mul_quant
2025-05-07T20:32:29.4844918Z             if compiled:
2025-05-07T20:32:29.4845018Z                 op = torch.compile(op)
2025-05-07T20:32:29.4845125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4845200Z     
2025-05-07T20:32:29.4845290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4845333Z 
2025-05-07T20:32:29.4845434Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4845568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4845669Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4845771Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4846151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4846247Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4846760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4846861Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4847230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4847462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4847823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4847921Z     kernel = self.compile(
2025-05-07T20:32:29.4848316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4848501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4848632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4848637Z 
2025-05-07T20:32:29.4848848Z self = <triton.compiler.compiler.ASTSource object at 0x7f328eccbd50>
2025-05-07T20:32:29.4849653Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4850178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1c7c0>}
2025-05-07T20:32:29.4850949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4851146Z context = <triton._C.libtriton.ir.context object at 0x7f30b7d3da70>
2025-05-07T20:32:29.4851151Z 
2025-05-07T20:32:29.4851320Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4851597Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4851708Z                            module_map=module_map)
2025-05-07T20:32:29.4851874Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4851974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4852050Z E       ^
2025-05-07T20:32:29.4852499Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4852507Z 
2025-05-07T20:32:29.4852936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4852941Z 
2025-05-07T20:32:29.4853049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4853278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4853355Z     T=2048,
2025-05-07T20:32:29.4853434Z     D=5120,
2025-05-07T20:32:29.4853518Z     scale_ub=1200.0,
2025-05-07T20:32:29.4853605Z     contiguous=False,
2025-05-07T20:32:29.4853731Z     compiled=True,
2025-05-07T20:32:29.4853804Z )
2025-05-07T20:32:29.4854026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4854211Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4854216Z 
2025-05-07T20:32:29.4854337Z     @given(
2025-05-07T20:32:29.4854467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4854568Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4854685Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4854806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4854921Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4854996Z     )
2025-05-07T20:32:29.4855253Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4855349Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4855426Z         self,
2025-05-07T20:32:29.4855510Z         T: int,
2025-05-07T20:32:29.4855586Z         D: int,
2025-05-07T20:32:29.4855684Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4855778Z         contiguous: bool,
2025-05-07T20:32:29.4855864Z         compiled: bool,
2025-05-07T20:32:29.4855947Z     ) -> None:
2025-05-07T20:32:29.4856043Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4856121Z     
2025-05-07T20:32:29.4856301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4856376Z     
2025-05-07T20:32:29.4856467Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4856594Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4856683Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4856763Z         x0 = x[:, :D]
2025-05-07T20:32:29.4856845Z         x1 = x[:, D:]
2025-05-07T20:32:29.4856917Z     
2025-05-07T20:32:29.4857001Z         if contiguous:
2025-05-07T20:32:29.4857095Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4857186Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4857260Z     
2025-05-07T20:32:29.4857354Z         if scale_ub is not None:
2025-05-07T20:32:29.4857461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4857604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4857680Z             )
2025-05-07T20:32:29.4857760Z         else:
2025-05-07T20:32:29.4857857Z             scale_ub_tensor = None
2025-05-07T20:32:29.4857935Z     
2025-05-07T20:32:29.4858067Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4858162Z             op = silu_mul_quant
2025-05-07T20:32:29.4858246Z             if compiled:
2025-05-07T20:32:29.4858345Z                 op = torch.compile(op)
2025-05-07T20:32:29.4858456Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4858529Z     
2025-05-07T20:32:29.4858621Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4858628Z 
2025-05-07T20:32:29.4858726Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4858855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4858961Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4859061Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4859440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4859540Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4860135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4860236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4860610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4860840Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4861195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4861375Z     kernel = self.compile(
2025-05-07T20:32:29.4861772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4861956Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4862085Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4862126Z 
2025-05-07T20:32:29.4862343Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f8c4b0d0>
2025-05-07T20:32:29.4863145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4863670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1d580>}
2025-05-07T20:32:29.4864498Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4864694Z context = <triton._C.libtriton.ir.context object at 0x7f30b7d029b0>
2025-05-07T20:32:29.4864701Z 
2025-05-07T20:32:29.4864877Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4865152Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4865261Z                            module_map=module_map)
2025-05-07T20:32:29.4865429Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4865529Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4865610Z E       ^
2025-05-07T20:32:29.4865978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4865985Z 
2025-05-07T20:32:29.4866420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4866424Z 
2025-05-07T20:32:29.4866531Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4866764Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4866850Z     T=4096,
2025-05-07T20:32:29.4866931Z     D=5120,
2025-05-07T20:32:29.4867015Z     scale_ub=1200.0,
2025-05-07T20:32:29.4867104Z     contiguous=True,
2025-05-07T20:32:29.4867188Z     compiled=True,
2025-05-07T20:32:29.4867260Z )
2025-05-07T20:32:29.4867486Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4867664Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4867669Z 
2025-05-07T20:32:29.4867749Z     @given(
2025-05-07T20:32:29.4867871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4867971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4868092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4868209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4868324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4868401Z     )
2025-05-07T20:32:29.4868656Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4868829Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4868910Z         self,
2025-05-07T20:32:29.4868987Z         T: int,
2025-05-07T20:32:29.4869065Z         D: int,
2025-05-07T20:32:29.4869164Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4869253Z         contiguous: bool,
2025-05-07T20:32:29.4869340Z         compiled: bool,
2025-05-07T20:32:29.4869418Z     ) -> None:
2025-05-07T20:32:29.4869513Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4869587Z     
2025-05-07T20:32:29.4869763Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4869878Z     
2025-05-07T20:32:29.4869973Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4870099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4870188Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4870271Z         x0 = x[:, :D]
2025-05-07T20:32:29.4870352Z         x1 = x[:, D:]
2025-05-07T20:32:29.4870462Z     
2025-05-07T20:32:29.4870551Z         if contiguous:
2025-05-07T20:32:29.4870647Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4870742Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4870814Z     
2025-05-07T20:32:29.4870906Z         if scale_ub is not None:
2025-05-07T20:32:29.4871014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4871152Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4871228Z             )
2025-05-07T20:32:29.4871307Z         else:
2025-05-07T20:32:29.4871401Z             scale_ub_tensor = None
2025-05-07T20:32:29.4871473Z     
2025-05-07T20:32:29.4871611Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4871701Z             op = silu_mul_quant
2025-05-07T20:32:29.4871786Z             if compiled:
2025-05-07T20:32:29.4871889Z                 op = torch.compile(op)
2025-05-07T20:32:29.4871996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4872072Z     
2025-05-07T20:32:29.4872162Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4872167Z 
2025-05-07T20:32:29.4872270Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4872402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4872503Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4872604Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4872992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4873088Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4873608Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4873712Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4874138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4874368Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4874729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4874826Z     kernel = self.compile(
2025-05-07T20:32:29.4875222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4875404Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4875534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4875538Z 
2025-05-07T20:32:29.4875748Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a1fd0>
2025-05-07T20:32:29.4876557Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4877223Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1e840>}
2025-05-07T20:32:29.4877998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4878194Z context = <triton._C.libtriton.ir.context object at 0x7f30b7ef0930>
2025-05-07T20:32:29.4878199Z 
2025-05-07T20:32:29.4878370Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4878681Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4878790Z                            module_map=module_map)
2025-05-07T20:32:29.4878956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4879055Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4879169Z E       ^
2025-05-07T20:32:29.4879546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4879550Z 
2025-05-07T20:32:29.4879978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4879983Z 
2025-05-07T20:32:29.4880089Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4880369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4880446Z     T=128,
2025-05-07T20:32:29.4880526Z     D=5120,
2025-05-07T20:32:29.4880611Z     scale_ub=1200.0,
2025-05-07T20:32:29.4880698Z     contiguous=False,
2025-05-07T20:32:29.4880783Z     compiled=True,
2025-05-07T20:32:29.4880856Z )
2025-05-07T20:32:29.4881078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4881258Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4881266Z 
2025-05-07T20:32:29.4881344Z     @given(
2025-05-07T20:32:29.4881474Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4881574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4881690Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4881811Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4881926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4882000Z     )
2025-05-07T20:32:29.4882262Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4882355Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4882433Z         self,
2025-05-07T20:32:29.4882512Z         T: int,
2025-05-07T20:32:29.4882588Z         D: int,
2025-05-07T20:32:29.4882690Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4882779Z         contiguous: bool,
2025-05-07T20:32:29.4882864Z         compiled: bool,
2025-05-07T20:32:29.4882949Z     ) -> None:
2025-05-07T20:32:29.4883045Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4883122Z     
2025-05-07T20:32:29.4883301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4883378Z     
2025-05-07T20:32:29.4883470Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4883598Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4883687Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4883767Z         x0 = x[:, :D]
2025-05-07T20:32:29.4883851Z         x1 = x[:, D:]
2025-05-07T20:32:29.4883923Z     
2025-05-07T20:32:29.4884007Z         if contiguous:
2025-05-07T20:32:29.4884100Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4884192Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4884270Z     
2025-05-07T20:32:29.4884363Z         if scale_ub is not None:
2025-05-07T20:32:29.4884469Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4884610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4884687Z             )
2025-05-07T20:32:29.4884764Z         else:
2025-05-07T20:32:29.4884947Z             scale_ub_tensor = None
2025-05-07T20:32:29.4885020Z     
2025-05-07T20:32:29.4885154Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4885248Z             op = silu_mul_quant
2025-05-07T20:32:29.4885335Z             if compiled:
2025-05-07T20:32:29.4885435Z                 op = torch.compile(op)
2025-05-07T20:32:29.4885549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4885619Z     
2025-05-07T20:32:29.4885712Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4885716Z 
2025-05-07T20:32:29.4885814Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4885983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4886086Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4886186Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4886565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4886709Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4887221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4887322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4887691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4887925Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4888281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4888378Z     kernel = self.compile(
2025-05-07T20:32:29.4888772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4888954Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4889089Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4889093Z 
2025-05-07T20:32:29.4889304Z self = <triton.compiler.compiler.ASTSource object at 0x7f328ecc8cd0>
2025-05-07T20:32:29.4890104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4890628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7d1f4c0>}
2025-05-07T20:32:29.4891401Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4891596Z context = <triton._C.libtriton.ir.context object at 0x7f30b7e479f0>
2025-05-07T20:32:29.4891606Z 
2025-05-07T20:32:29.4891785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4892056Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4892167Z                            module_map=module_map)
2025-05-07T20:32:29.4892331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4892430Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4892510Z E       ^
2025-05-07T20:32:29.4892878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4892885Z 
2025-05-07T20:32:29.4893313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4893321Z 
2025-05-07T20:32:29.4893427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4893657Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4893814Z     T=16384,
2025-05-07T20:32:29.4893893Z     D=7168,
2025-05-07T20:32:29.4893977Z     scale_ub=1200.0,
2025-05-07T20:32:29.4894065Z     contiguous=True,
2025-05-07T20:32:29.4894147Z     compiled=True,
2025-05-07T20:32:29.4894219Z )
2025-05-07T20:32:29.4894446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4894627Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4894631Z 
2025-05-07T20:32:29.4894715Z     @given(
2025-05-07T20:32:29.4894839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4894980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4895100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4895218Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4895332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4895448Z     )
2025-05-07T20:32:29.4895707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4895801Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4895881Z         self,
2025-05-07T20:32:29.4895959Z         T: int,
2025-05-07T20:32:29.4896034Z         D: int,
2025-05-07T20:32:29.4896136Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4896226Z         contiguous: bool,
2025-05-07T20:32:29.4896318Z         compiled: bool,
2025-05-07T20:32:29.4896397Z     ) -> None:
2025-05-07T20:32:29.4896492Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4896568Z     
2025-05-07T20:32:29.4896746Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4896819Z     
2025-05-07T20:32:29.4896913Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4897040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4897128Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4897215Z         x0 = x[:, :D]
2025-05-07T20:32:29.4897294Z         x1 = x[:, D:]
2025-05-07T20:32:29.4897370Z     
2025-05-07T20:32:29.4897457Z         if contiguous:
2025-05-07T20:32:29.4897549Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4897638Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4897713Z     
2025-05-07T20:32:29.4897804Z         if scale_ub is not None:
2025-05-07T20:32:29.4897914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4898052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4898131Z             )
2025-05-07T20:32:29.4898210Z         else:
2025-05-07T20:32:29.4898304Z             scale_ub_tensor = None
2025-05-07T20:32:29.4898378Z     
2025-05-07T20:32:29.4898515Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4898605Z             op = silu_mul_quant
2025-05-07T20:32:29.4898690Z             if compiled:
2025-05-07T20:32:29.4898794Z                 op = torch.compile(op)
2025-05-07T20:32:29.4898903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4898979Z     
2025-05-07T20:32:29.4899079Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4899083Z 
2025-05-07T20:32:29.4899181Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4899316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4899418Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4899519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4899950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4900045Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4900560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4900663Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4901033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4901351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4901707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4901805Z     kernel = self.compile(
2025-05-07T20:32:29.4902204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4902386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4902518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4902560Z 
2025-05-07T20:32:29.4902774Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9d135d0>
2025-05-07T20:32:29.4903577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4904172Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e34c20>}
2025-05-07T20:32:29.4904948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4905172Z context = <triton._C.libtriton.ir.context object at 0x7f30b7e1b3f0>
2025-05-07T20:32:29.4905177Z 
2025-05-07T20:32:29.4905371Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4905646Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4905757Z                            module_map=module_map)
2025-05-07T20:32:29.4905922Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4906030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4906112Z E       ^
2025-05-07T20:32:29.4906478Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4906482Z 
2025-05-07T20:32:29.4906916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4906920Z 
2025-05-07T20:32:29.4907026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4907261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4907340Z     T=16384,
2025-05-07T20:32:29.4907420Z     D=5120,
2025-05-07T20:32:29.4907507Z     scale_ub=1200.0,
2025-05-07T20:32:29.4907591Z     contiguous=True,
2025-05-07T20:32:29.4907676Z     compiled=False,
2025-05-07T20:32:29.4907751Z )
2025-05-07T20:32:29.4907975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4908161Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.4908169Z 
2025-05-07T20:32:29.4908249Z     @given(
2025-05-07T20:32:29.4908371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4908476Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4908593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4908711Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4908830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4908904Z     )
2025-05-07T20:32:29.4909156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4909259Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4909336Z         self,
2025-05-07T20:32:29.4909411Z         T: int,
2025-05-07T20:32:29.4909490Z         D: int,
2025-05-07T20:32:29.4909589Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4909679Z         contiguous: bool,
2025-05-07T20:32:29.4909769Z         compiled: bool,
2025-05-07T20:32:29.4909847Z     ) -> None:
2025-05-07T20:32:29.4910031Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4910105Z     
2025-05-07T20:32:29.4910281Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4910359Z     
2025-05-07T20:32:29.4910452Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4910577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4910669Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4910748Z         x0 = x[:, :D]
2025-05-07T20:32:29.4910827Z         x1 = x[:, D:]
2025-05-07T20:32:29.4910903Z     
2025-05-07T20:32:29.4911027Z         if contiguous:
2025-05-07T20:32:29.4911120Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4911216Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4911287Z     
2025-05-07T20:32:29.4911377Z         if scale_ub is not None:
2025-05-07T20:32:29.4911488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4911665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4911748Z             )
2025-05-07T20:32:29.4911824Z         else:
2025-05-07T20:32:29.4911919Z             scale_ub_tensor = None
2025-05-07T20:32:29.4911994Z     
2025-05-07T20:32:29.4912127Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4912219Z             op = silu_mul_quant
2025-05-07T20:32:29.4912306Z             if compiled:
2025-05-07T20:32:29.4912406Z                 op = torch.compile(op)
2025-05-07T20:32:29.4912512Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4912586Z     
2025-05-07T20:32:29.4912676Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4912682Z 
2025-05-07T20:32:29.4912783Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4912913Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4913014Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4913117Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4914067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4914171Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4914549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4914776Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4915132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4915227Z     kernel = self.compile(
2025-05-07T20:32:29.4915622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4915809Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4915940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4915947Z 
2025-05-07T20:32:29.4916160Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7be53d0>
2025-05-07T20:32:29.4916961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4917479Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e35580>}
2025-05-07T20:32:29.4918246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4918441Z context = <triton._C.libtriton.ir.context object at 0x7f30b7c120f0>
2025-05-07T20:32:29.4918446Z 
2025-05-07T20:32:29.4918617Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4919029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4919139Z                            module_map=module_map)
2025-05-07T20:32:29.4919303Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4919401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4919477Z E       ^
2025-05-07T20:32:29.4919844Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4919848Z 
2025-05-07T20:32:29.4920340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4920401Z 
2025-05-07T20:32:29.4920508Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4920734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4920866Z     T=1,
2025-05-07T20:32:29.4920943Z     D=7168,
2025-05-07T20:32:29.4921033Z     scale_ub=1200.0,
2025-05-07T20:32:29.4923777Z     contiguous=False,
2025-05-07T20:32:29.4923886Z     compiled=False,
2025-05-07T20:32:29.4923961Z )
2025-05-07T20:32:29.4924190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4924369Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.4924374Z 
2025-05-07T20:32:29.4924453Z     @given(
2025-05-07T20:32:29.4924574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4924678Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4924797Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4924922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4925038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4925114Z     )
2025-05-07T20:32:29.4925371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4925469Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4925549Z         self,
2025-05-07T20:32:29.4925650Z         T: int,
2025-05-07T20:32:29.4925728Z         D: int,
2025-05-07T20:32:29.4925827Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4925919Z         contiguous: bool,
2025-05-07T20:32:29.4926006Z         compiled: bool,
2025-05-07T20:32:29.4926085Z     ) -> None:
2025-05-07T20:32:29.4926184Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4926256Z     
2025-05-07T20:32:29.4926428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4926506Z     
2025-05-07T20:32:29.4926603Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4926729Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4926821Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4926904Z         x0 = x[:, :D]
2025-05-07T20:32:29.4926987Z         x1 = x[:, D:]
2025-05-07T20:32:29.4927058Z     
2025-05-07T20:32:29.4927146Z         if contiguous:
2025-05-07T20:32:29.4927240Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4927333Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4927404Z     
2025-05-07T20:32:29.4927501Z         if scale_ub is not None:
2025-05-07T20:32:29.4927610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4927747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4927827Z             )
2025-05-07T20:32:29.4927902Z         else:
2025-05-07T20:32:29.4927997Z             scale_ub_tensor = None
2025-05-07T20:32:29.4928074Z     
2025-05-07T20:32:29.4928207Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4928305Z             op = silu_mul_quant
2025-05-07T20:32:29.4928392Z             if compiled:
2025-05-07T20:32:29.4928491Z                 op = torch.compile(op)
2025-05-07T20:32:29.4928602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4928674Z     
2025-05-07T20:32:29.4928764Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4928770Z 
2025-05-07T20:32:29.4928931Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4929066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4929168Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4929272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4929786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4929889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4930259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4930530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4930886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4930981Z     kernel = self.compile(
2025-05-07T20:32:29.4931420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4931684Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4931818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4931822Z 
2025-05-07T20:32:29.4932036Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959fe50>
2025-05-07T20:32:29.4932840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4933367Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e368e0>}
2025-05-07T20:32:29.4934143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4934342Z context = <triton._C.libtriton.ir.context object at 0x7f30b7c73770>
2025-05-07T20:32:29.4934347Z 
2025-05-07T20:32:29.4934517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4934789Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4934900Z                            module_map=module_map)
2025-05-07T20:32:29.4935065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4935168Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4935253Z E       ^
2025-05-07T20:32:29.4935619Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4935623Z 
2025-05-07T20:32:29.4936056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4936067Z 
2025-05-07T20:32:29.4936177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4936408Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4936489Z     T=4096,
2025-05-07T20:32:29.4936568Z     D=7168,
2025-05-07T20:32:29.4936652Z     scale_ub=1200.0,
2025-05-07T20:32:29.4936741Z     contiguous=False,
2025-05-07T20:32:29.4936825Z     compiled=True,
2025-05-07T20:32:29.4936900Z )
2025-05-07T20:32:29.4937129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4937312Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4937317Z 
2025-05-07T20:32:29.4937398Z     @given(
2025-05-07T20:32:29.4937519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4937618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4937739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4937903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4938022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4938101Z     )
2025-05-07T20:32:29.4938353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4938448Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4938528Z         self,
2025-05-07T20:32:29.4938605Z         T: int,
2025-05-07T20:32:29.4938682Z         D: int,
2025-05-07T20:32:29.4938784Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4938874Z         contiguous: bool,
2025-05-07T20:32:29.4939029Z         compiled: bool,
2025-05-07T20:32:29.4939108Z     ) -> None:
2025-05-07T20:32:29.4939204Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4942624Z     
2025-05-07T20:32:29.4942820Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4942900Z     
2025-05-07T20:32:29.4943070Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4943208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4943354Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4943441Z         x0 = x[:, :D]
2025-05-07T20:32:29.4943527Z         x1 = x[:, D:]
2025-05-07T20:32:29.4943601Z     
2025-05-07T20:32:29.4943686Z         if contiguous:
2025-05-07T20:32:29.4943780Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4943869Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4943941Z     
2025-05-07T20:32:29.4944040Z         if scale_ub is not None:
2025-05-07T20:32:29.4944149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4944292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4944371Z             )
2025-05-07T20:32:29.4944447Z         else:
2025-05-07T20:32:29.4944541Z             scale_ub_tensor = None
2025-05-07T20:32:29.4944616Z     
2025-05-07T20:32:29.4944751Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4944849Z             op = silu_mul_quant
2025-05-07T20:32:29.4944937Z             if compiled:
2025-05-07T20:32:29.4945040Z                 op = torch.compile(op)
2025-05-07T20:32:29.4945151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4945224Z     
2025-05-07T20:32:29.4945319Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4945324Z 
2025-05-07T20:32:29.4945429Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4945562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4945664Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4945768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4946156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4946255Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4946764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4946869Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4947246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4947474Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4947830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4947925Z     kernel = self.compile(
2025-05-07T20:32:29.4948318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4948505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4948634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4948639Z 
2025-05-07T20:32:29.4948848Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a1e50>
2025-05-07T20:32:29.4949705Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4950228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7e37a60>}
2025-05-07T20:32:29.4951002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4951238Z context = <triton._C.libtriton.ir.context object at 0x7f31f811e5b0>
2025-05-07T20:32:29.4951243Z 
2025-05-07T20:32:29.4951415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4951689Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4951838Z                            module_map=module_map)
2025-05-07T20:32:29.4952050Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4952156Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4952233Z E       ^
2025-05-07T20:32:29.4952603Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4952607Z 
2025-05-07T20:32:29.4953033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4953039Z 
2025-05-07T20:32:29.4953146Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4953375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4953453Z     T=128,
2025-05-07T20:32:29.4953534Z     D=7168,
2025-05-07T20:32:29.4953619Z     scale_ub=1200.0,
2025-05-07T20:32:29.4953711Z     contiguous=False,
2025-05-07T20:32:29.4953806Z     compiled=True,
2025-05-07T20:32:29.4953899Z )
2025-05-07T20:32:29.4954150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4954330Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:29.4954335Z 
2025-05-07T20:32:29.4954411Z     @given(
2025-05-07T20:32:29.4954536Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4954638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4954755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4954879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4954997Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4955072Z     )
2025-05-07T20:32:29.4955331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4955425Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4955507Z         self,
2025-05-07T20:32:29.4955591Z         T: int,
2025-05-07T20:32:29.4955666Z         D: int,
2025-05-07T20:32:29.4955777Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4955867Z         contiguous: bool,
2025-05-07T20:32:29.4955953Z         compiled: bool,
2025-05-07T20:32:29.4956035Z     ) -> None:
2025-05-07T20:32:29.4956129Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4956201Z     
2025-05-07T20:32:29.4956380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4956459Z     
2025-05-07T20:32:29.4956552Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4956681Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4956773Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4956856Z         x0 = x[:, :D]
2025-05-07T20:32:29.4956938Z         x1 = x[:, D:]
2025-05-07T20:32:29.4957011Z     
2025-05-07T20:32:29.4957098Z         if contiguous:
2025-05-07T20:32:29.4957190Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4957280Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4957359Z     
2025-05-07T20:32:29.4957498Z         if scale_ub is not None:
2025-05-07T20:32:29.4957610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4957755Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4957830Z             )
2025-05-07T20:32:29.4957905Z         else:
2025-05-07T20:32:29.4958005Z             scale_ub_tensor = None
2025-05-07T20:32:29.4958077Z     
2025-05-07T20:32:29.4958209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4958303Z             op = silu_mul_quant
2025-05-07T20:32:29.4958392Z             if compiled:
2025-05-07T20:32:29.4958537Z                 op = torch.compile(op)
2025-05-07T20:32:29.4958643Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4958715Z     
2025-05-07T20:32:29.4958810Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4958815Z 
2025-05-07T20:32:29.4958913Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4959082Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4959189Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4959333Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4959714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4959813Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4960424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4960532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4960901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4961134Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4961490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4961592Z     kernel = self.compile(
2025-05-07T20:32:29.4961995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4962174Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4962303Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4962308Z 
2025-05-07T20:32:29.4962521Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959e050>
2025-05-07T20:32:29.4963325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4963879Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f817cea0>}
2025-05-07T20:32:29.4964680Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4964877Z context = <triton._C.libtriton.ir.context object at 0x7f31f81d5370>
2025-05-07T20:32:29.4964881Z 
2025-05-07T20:32:29.4965053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4965325Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4965437Z                            module_map=module_map)
2025-05-07T20:32:29.4965605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4965705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4965785Z E       ^
2025-05-07T20:32:29.4966150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4966157Z 
2025-05-07T20:32:29.4966637Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4966642Z 
2025-05-07T20:32:29.4966748Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4966977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4967058Z     T=2048,
2025-05-07T20:32:29.4967138Z     D=7168,
2025-05-07T20:32:29.4967220Z     scale_ub=None,
2025-05-07T20:32:29.4967310Z     contiguous=True,
2025-05-07T20:32:29.4967394Z     compiled=True,
2025-05-07T20:32:29.4967467Z )
2025-05-07T20:32:29.4967695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4967913Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.4967917Z 
2025-05-07T20:32:29.4967998Z     @given(
2025-05-07T20:32:29.4968119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4968259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4968384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4968545Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4968661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4968739Z     )
2025-05-07T20:32:29.4968992Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4969085Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4969166Z         self,
2025-05-07T20:32:29.4969243Z         T: int,
2025-05-07T20:32:29.4969324Z         D: int,
2025-05-07T20:32:29.4969424Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4969516Z         contiguous: bool,
2025-05-07T20:32:29.4969605Z         compiled: bool,
2025-05-07T20:32:29.4969683Z     ) -> None:
2025-05-07T20:32:29.4969780Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4969856Z     
2025-05-07T20:32:29.4970031Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4970107Z     
2025-05-07T20:32:29.4970204Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4970333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4970422Z         x = x_sign * x_clamp
2025-05-07T20:32:29.4970508Z         x0 = x[:, :D]
2025-05-07T20:32:29.4970587Z         x1 = x[:, D:]
2025-05-07T20:32:29.4970658Z     
2025-05-07T20:32:29.4970746Z         if contiguous:
2025-05-07T20:32:29.4970837Z             x0 = x0.contiguous()
2025-05-07T20:32:29.4970934Z             x1 = x1.contiguous()
2025-05-07T20:32:29.4971007Z     
2025-05-07T20:32:29.4971098Z         if scale_ub is not None:
2025-05-07T20:32:29.4971209Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.4971347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.4971423Z             )
2025-05-07T20:32:29.4971502Z         else:
2025-05-07T20:32:29.4971597Z             scale_ub_tensor = None
2025-05-07T20:32:29.4971670Z     
2025-05-07T20:32:29.4971809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.4971902Z             op = silu_mul_quant
2025-05-07T20:32:29.4971989Z             if compiled:
2025-05-07T20:32:29.4972093Z                 op = torch.compile(op)
2025-05-07T20:32:29.4972200Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4972276Z     
2025-05-07T20:32:29.4972368Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.4972373Z 
2025-05-07T20:32:29.4972469Z moe/activation_test.py:117: 
2025-05-07T20:32:29.4972606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4972708Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.4972809Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.4973197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.4973292Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.4973873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.4973978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.4974347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.4974582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.4974932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.4975027Z     kernel = self.compile(
2025-05-07T20:32:29.4975426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.4975650Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.4975787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.4975792Z 
2025-05-07T20:32:29.4976042Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f9d120d0>
2025-05-07T20:32:29.4976887Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.4977411Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f31f817dc60>}
2025-05-07T20:32:29.4978179Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.4978382Z context = <triton._C.libtriton.ir.context object at 0x7f30b7b00a30>
2025-05-07T20:32:29.4978386Z 
2025-05-07T20:32:29.4978554Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.4978835Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.4978949Z                            module_map=module_map)
2025-05-07T20:32:29.4979113Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.4979219Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.4979295Z E       ^
2025-05-07T20:32:29.4979660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.4979664Z 
2025-05-07T20:32:29.4980097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.4980103Z 
2025-05-07T20:32:29.4980208Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4980445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4980522Z     T=16384,
2025-05-07T20:32:29.4980597Z     D=5120,
2025-05-07T20:32:29.4980684Z     scale_ub=None,
2025-05-07T20:32:29.4980771Z     contiguous=False,
2025-05-07T20:32:29.4980856Z     compiled=False,
2025-05-07T20:32:29.4980936Z )
2025-05-07T20:32:29.4981158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4981343Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4981351Z 
2025-05-07T20:32:29.4981427Z     @given(
2025-05-07T20:32:29.4981548Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4981652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4981768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4981888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4982006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4982081Z     )
2025-05-07T20:32:29.4982337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4982433Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4982511Z         self,
2025-05-07T20:32:29.4982631Z         T: int,
2025-05-07T20:32:29.4982714Z         D: int,
2025-05-07T20:32:29.4982813Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4982907Z         contiguous: bool,
2025-05-07T20:32:29.4982993Z         compiled: bool,
2025-05-07T20:32:29.4983072Z     ) -> None:
2025-05-07T20:32:29.4983171Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4983245Z     
2025-05-07T20:32:29.4983421Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4983499Z     
2025-05-07T20:32:29.4983595Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4983778Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4985732Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.4985774Z 
2025-05-07T20:32:29.4985896Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.4985900Z 
2025-05-07T20:32:29.4986011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4986242Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4986323Z     T=4096,
2025-05-07T20:32:29.4986401Z     D=7168,
2025-05-07T20:32:29.4986484Z     scale_ub=1200.0,
2025-05-07T20:32:29.4986573Z     contiguous=True,
2025-05-07T20:32:29.4986655Z     compiled=True,
2025-05-07T20:32:29.4986728Z )
2025-05-07T20:32:29.4986953Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4987134Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4987141Z 
2025-05-07T20:32:29.4987220Z     @given(
2025-05-07T20:32:29.4987345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4987445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4987565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4987683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4987797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4987873Z     )
2025-05-07T20:32:29.4988130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4988226Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4988304Z         self,
2025-05-07T20:32:29.4988380Z         T: int,
2025-05-07T20:32:29.4988457Z         D: int,
2025-05-07T20:32:29.4988559Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4988653Z         contiguous: bool,
2025-05-07T20:32:29.4988741Z         compiled: bool,
2025-05-07T20:32:29.4988821Z     ) -> None:
2025-05-07T20:32:29.4988920Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4988999Z     
2025-05-07T20:32:29.4989174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4989248Z     
2025-05-07T20:32:29.4989343Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.4989468Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.4991320Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.4991334Z 
2025-05-07T20:32:29.4991499Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.4991506Z 
2025-05-07T20:32:29.4991612Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4991844Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4991921Z     T=16384,
2025-05-07T20:32:29.4991999Z     D=7168,
2025-05-07T20:32:29.4992083Z     scale_ub=None,
2025-05-07T20:32:29.4992171Z     contiguous=False,
2025-05-07T20:32:29.4992255Z     compiled=False,
2025-05-07T20:32:29.4992331Z )
2025-05-07T20:32:29.4992555Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4992780Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.4992785Z 
2025-05-07T20:32:29.4992862Z     @given(
2025-05-07T20:32:29.4992983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4993090Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4993243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4993404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4993524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4993599Z     )
2025-05-07T20:32:29.4993857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4993952Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4994037Z         self,
2025-05-07T20:32:29.4994133Z         T: int,
2025-05-07T20:32:29.4994215Z         D: int,
2025-05-07T20:32:29.4994333Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4994428Z         contiguous: bool,
2025-05-07T20:32:29.4994515Z         compiled: bool,
2025-05-07T20:32:29.4994598Z     ) -> None:
2025-05-07T20:32:29.4994694Z         torch.manual_seed(2025)
2025-05-07T20:32:29.4994766Z     
2025-05-07T20:32:29.4994942Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.4996794Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.4996801Z 
2025-05-07T20:32:29.4996922Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.4996929Z 
2025-05-07T20:32:29.4997033Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.4997260Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.4997342Z     T=2048,
2025-05-07T20:32:29.4997420Z     D=7168,
2025-05-07T20:32:29.4997505Z     scale_ub=1200.0,
2025-05-07T20:32:29.4997597Z     contiguous=True,
2025-05-07T20:32:29.4997682Z     compiled=True,
2025-05-07T20:32:29.4997755Z )
2025-05-07T20:32:29.4997983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.4998161Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.4998165Z 
2025-05-07T20:32:29.4998245Z     @given(
2025-05-07T20:32:29.4998365Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.4998464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.4998582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.4998703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.4998817Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.4998893Z     )
2025-05-07T20:32:29.4999146Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.4999243Z     def test_silu_mul_quant(
2025-05-07T20:32:29.4999320Z         self,
2025-05-07T20:32:29.4999396Z         T: int,
2025-05-07T20:32:29.4999564Z         D: int,
2025-05-07T20:32:29.4999666Z         scale_ub: Optional[float],
2025-05-07T20:32:29.4999755Z         contiguous: bool,
2025-05-07T20:32:29.4999844Z         compiled: bool,
2025-05-07T20:32:29.4999922Z     ) -> None:
2025-05-07T20:32:29.5000018Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5000093Z     
2025-05-07T20:32:29.5000330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5000404Z     
2025-05-07T20:32:29.5000502Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5000628Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5002550Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5002615Z 
2025-05-07T20:32:29.5002737Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.5002742Z 
2025-05-07T20:32:29.5002849Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5003079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5003156Z     T=2048,
2025-05-07T20:32:29.5003235Z     D=7168,
2025-05-07T20:32:29.5003325Z     scale_ub=None,
2025-05-07T20:32:29.5003411Z     contiguous=True,
2025-05-07T20:32:29.5003500Z     compiled=False,
2025-05-07T20:32:29.5003574Z )
2025-05-07T20:32:29.5003796Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5003976Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5003986Z 
2025-05-07T20:32:29.5004066Z     @given(
2025-05-07T20:32:29.5004191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5004290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5004406Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5004527Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5004643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5004718Z     )
2025-05-07T20:32:29.5004975Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5005073Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5005150Z         self,
2025-05-07T20:32:29.5005234Z         T: int,
2025-05-07T20:32:29.5005316Z         D: int,
2025-05-07T20:32:29.5005414Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5005503Z         contiguous: bool,
2025-05-07T20:32:29.5005591Z         compiled: bool,
2025-05-07T20:32:29.5005672Z     ) -> None:
2025-05-07T20:32:29.5005771Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5005845Z     
2025-05-07T20:32:29.5006018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5006093Z     
2025-05-07T20:32:29.5006187Z >       x_sign = torch.sign(x)
2025-05-07T20:32:29.5008017Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5008026Z 
2025-05-07T20:32:29.5008151Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:29.5008158Z 
2025-05-07T20:32:29.5008306Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5008540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5008617Z     T=1,
2025-05-07T20:32:29.5008693Z     D=7168,
2025-05-07T20:32:29.5008780Z     scale_ub=1200.0,
2025-05-07T20:32:29.5008865Z     contiguous=True,
2025-05-07T20:32:29.5008950Z     compiled=False,
2025-05-07T20:32:29.5009024Z )
2025-05-07T20:32:29.5009244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5009415Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.5009459Z 
2025-05-07T20:32:29.5009538Z     @given(
2025-05-07T20:32:29.5009656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5009754Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5009871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5010026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5010148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5010262Z     )
2025-05-07T20:32:29.5010513Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5010609Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5010684Z         self,
2025-05-07T20:32:29.5010760Z         T: int,
2025-05-07T20:32:29.5010838Z         D: int,
2025-05-07T20:32:29.5010937Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5011028Z         contiguous: bool,
2025-05-07T20:32:29.5011119Z         compiled: bool,
2025-05-07T20:32:29.5011195Z     ) -> None:
2025-05-07T20:32:29.5011292Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5011366Z     
2025-05-07T20:32:29.5011534Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5011610Z     
2025-05-07T20:32:29.5011701Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5011825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5011919Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5012002Z         x0 = x[:, :D]
2025-05-07T20:32:29.5012083Z         x1 = x[:, D:]
2025-05-07T20:32:29.5012159Z     
2025-05-07T20:32:29.5012244Z         if contiguous:
2025-05-07T20:32:29.5012335Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5012426Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5012498Z     
2025-05-07T20:32:29.5012587Z         if scale_ub is not None:
2025-05-07T20:32:29.5012697Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5012835Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5012915Z             )
2025-05-07T20:32:29.5012989Z         else:
2025-05-07T20:32:29.5013083Z             scale_ub_tensor = None
2025-05-07T20:32:29.5013160Z     
2025-05-07T20:32:29.5013292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5013638Z             op = silu_mul_quant
2025-05-07T20:32:29.5013745Z             if compiled:
2025-05-07T20:32:29.5013859Z                 op = torch.compile(op)
2025-05-07T20:32:29.5013994Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5014070Z     
2025-05-07T20:32:29.5014160Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5014164Z 
2025-05-07T20:32:29.5014262Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5014399Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5014500Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5014603Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5015124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5015225Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5015598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5015826Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5016270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5016366Z     kernel = self.compile(
2025-05-07T20:32:29.5016761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5016943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5017070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5017074Z 
2025-05-07T20:32:29.5017285Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f838f450>
2025-05-07T20:32:29.5018148Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5018782Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc4b80>}
2025-05-07T20:32:29.5019559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5019755Z context = <triton._C.libtriton.ir.context object at 0x7f30b79e75b0>
2025-05-07T20:32:29.5019759Z 
2025-05-07T20:32:29.5019929Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5020204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5020312Z                            module_map=module_map)
2025-05-07T20:32:29.5020480Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5020579Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5020658Z E       ^
2025-05-07T20:32:29.5021031Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5021036Z 
2025-05-07T20:32:29.5021463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5021468Z 
2025-05-07T20:32:29.5021576Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5021806Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5021882Z     T=128,
2025-05-07T20:32:29.5021961Z     D=5120,
2025-05-07T20:32:29.5022042Z     scale_ub=None,
2025-05-07T20:32:29.5022129Z     contiguous=True,
2025-05-07T20:32:29.5022216Z     compiled=False,
2025-05-07T20:32:29.5022290Z )
2025-05-07T20:32:29.5022517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5022691Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5022698Z 
2025-05-07T20:32:29.5022774Z     @given(
2025-05-07T20:32:29.5022901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5023001Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5023117Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5023237Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5023350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5023424Z     )
2025-05-07T20:32:29.5023678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5023771Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5023854Z         self,
2025-05-07T20:32:29.5023929Z         T: int,
2025-05-07T20:32:29.5024008Z         D: int,
2025-05-07T20:32:29.5024110Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5024198Z         contiguous: bool,
2025-05-07T20:32:29.5024284Z         compiled: bool,
2025-05-07T20:32:29.5024364Z     ) -> None:
2025-05-07T20:32:29.5024461Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5024578Z     
2025-05-07T20:32:29.5024759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5024833Z     
2025-05-07T20:32:29.5024928Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5025056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5025143Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5025231Z         x0 = x[:, :D]
2025-05-07T20:32:29.5025312Z         x1 = x[:, D:]
2025-05-07T20:32:29.5025383Z     
2025-05-07T20:32:29.5025470Z         if contiguous:
2025-05-07T20:32:29.5025563Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5025693Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5025767Z     
2025-05-07T20:32:29.5025857Z         if scale_ub is not None:
2025-05-07T20:32:29.5025963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5026103Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5026217Z             )
2025-05-07T20:32:29.5026292Z         else:
2025-05-07T20:32:29.5026391Z             scale_ub_tensor = None
2025-05-07T20:32:29.5026504Z     
2025-05-07T20:32:29.5026636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5026729Z             op = silu_mul_quant
2025-05-07T20:32:29.5026813Z             if compiled:
2025-05-07T20:32:29.5026915Z                 op = torch.compile(op)
2025-05-07T20:32:29.5027020Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5027093Z     
2025-05-07T20:32:29.5027186Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5027191Z 
2025-05-07T20:32:29.5027287Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5027417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5027524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5027623Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5028134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5028242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5028610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5028842Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5029192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5029288Z     kernel = self.compile(
2025-05-07T20:32:29.5029687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5029870Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5030002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5030006Z 
2025-05-07T20:32:29.5030214Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7b7c1d0>
2025-05-07T20:32:29.5031020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5031541Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc5a80>}
2025-05-07T20:32:29.5032306Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5032506Z context = <triton._C.libtriton.ir.context object at 0x7f30b7974ff0>
2025-05-07T20:32:29.5032510Z 
2025-05-07T20:32:29.5032677Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5032994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5033110Z                            module_map=module_map)
2025-05-07T20:32:29.5033273Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5033374Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5033451Z E       ^
2025-05-07T20:32:29.5033817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5033822Z 
2025-05-07T20:32:29.5034254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5034297Z 
2025-05-07T20:32:29.5034403Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5034633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5034710Z     T=128,
2025-05-07T20:32:29.5034785Z     D=7168,
2025-05-07T20:32:29.5034870Z     scale_ub=None,
2025-05-07T20:32:29.5035017Z     contiguous=True,
2025-05-07T20:32:29.5035104Z     compiled=False,
2025-05-07T20:32:29.5035180Z )
2025-05-07T20:32:29.5035446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5035625Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5035629Z 
2025-05-07T20:32:29.5035712Z     @given(
2025-05-07T20:32:29.5035834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5035937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5036053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5036173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5036290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5036362Z     )
2025-05-07T20:32:29.5036617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5036713Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5036792Z         self,
2025-05-07T20:32:29.5036867Z         T: int,
2025-05-07T20:32:29.5036947Z         D: int,
2025-05-07T20:32:29.5037050Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5037142Z         contiguous: bool,
2025-05-07T20:32:29.5037233Z         compiled: bool,
2025-05-07T20:32:29.5037314Z     ) -> None:
2025-05-07T20:32:29.5037416Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5037487Z     
2025-05-07T20:32:29.5037661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5037738Z     
2025-05-07T20:32:29.5037831Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5037955Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5038048Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5038128Z         x0 = x[:, :D]
2025-05-07T20:32:29.5038207Z         x1 = x[:, D:]
2025-05-07T20:32:29.5038281Z     
2025-05-07T20:32:29.5038363Z         if contiguous:
2025-05-07T20:32:29.5038455Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5038548Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5038623Z     
2025-05-07T20:32:29.5038719Z         if scale_ub is not None:
2025-05-07T20:32:29.5038825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5038961Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5039039Z             )
2025-05-07T20:32:29.5039114Z         else:
2025-05-07T20:32:29.5039207Z             scale_ub_tensor = None
2025-05-07T20:32:29.5039282Z     
2025-05-07T20:32:29.5039414Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5039505Z             op = silu_mul_quant
2025-05-07T20:32:29.5039596Z             if compiled:
2025-05-07T20:32:29.5039696Z                 op = torch.compile(op)
2025-05-07T20:32:29.5039801Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5039876Z     
2025-05-07T20:32:29.5039968Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5039972Z 
2025-05-07T20:32:29.5040072Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5040300Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5040404Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5040510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5041022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5041121Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5041495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5041725Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5042121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5042214Z     kernel = self.compile(
2025-05-07T20:32:29.5042610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5042877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5043007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5043012Z 
2025-05-07T20:32:29.5043223Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f959e5d0>
2025-05-07T20:32:29.5044021Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5044542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc6980>}
2025-05-07T20:32:29.5045311Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5045515Z context = <triton._C.libtriton.ir.context object at 0x7f30b79fa8b0>
2025-05-07T20:32:29.5045519Z 
2025-05-07T20:32:29.5045692Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5045962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5046070Z                            module_map=module_map)
2025-05-07T20:32:29.5046235Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5046334Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5046415Z E       ^
2025-05-07T20:32:29.5046785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5046789Z 
2025-05-07T20:32:29.5047215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5047223Z 
2025-05-07T20:32:29.5047334Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5047563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5047639Z     T=2048,
2025-05-07T20:32:29.5047720Z     D=7168,
2025-05-07T20:32:29.5047803Z     scale_ub=1200.0,
2025-05-07T20:32:29.5047890Z     contiguous=True,
2025-05-07T20:32:29.5047974Z     compiled=False,
2025-05-07T20:32:29.5048045Z )
2025-05-07T20:32:29.5048270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5048450Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.5048458Z 
2025-05-07T20:32:29.5048535Z     @given(
2025-05-07T20:32:29.5048657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5048756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5048871Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5048994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5049153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5049233Z     )
2025-05-07T20:32:29.5049486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5049578Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5049656Z         self,
2025-05-07T20:32:29.5049732Z         T: int,
2025-05-07T20:32:29.5049806Z         D: int,
2025-05-07T20:32:29.5049906Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5049995Z         contiguous: bool,
2025-05-07T20:32:29.5050080Z         compiled: bool,
2025-05-07T20:32:29.5050160Z     ) -> None:
2025-05-07T20:32:29.5050297Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5050370Z     
2025-05-07T20:32:29.5050544Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5052438Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5052491Z 
2025-05-07T20:32:29.5052614Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5052618Z 
2025-05-07T20:32:29.5052721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5052953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5053029Z     T=1,
2025-05-07T20:32:29.5053106Z     D=5120,
2025-05-07T20:32:29.5053192Z     scale_ub=1200.0,
2025-05-07T20:32:29.5053276Z     contiguous=True,
2025-05-07T20:32:29.5053361Z     compiled=False,
2025-05-07T20:32:29.5053440Z )
2025-05-07T20:32:29.5053663Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5053860Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.5053865Z 
2025-05-07T20:32:29.5053965Z     @given(
2025-05-07T20:32:29.5054086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5054189Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5054302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5054420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5054536Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5054611Z     )
2025-05-07T20:32:29.5054862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5054957Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5055031Z         self,
2025-05-07T20:32:29.5055106Z         T: int,
2025-05-07T20:32:29.5055186Z         D: int,
2025-05-07T20:32:29.5055288Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5055383Z         contiguous: bool,
2025-05-07T20:32:29.5055471Z         compiled: bool,
2025-05-07T20:32:29.5055548Z     ) -> None:
2025-05-07T20:32:29.5055644Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5055717Z     
2025-05-07T20:32:29.5055888Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5055964Z     
2025-05-07T20:32:29.5056056Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5056182Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5056271Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5056350Z         x0 = x[:, :D]
2025-05-07T20:32:29.5056432Z         x1 = x[:, D:]
2025-05-07T20:32:29.5056505Z     
2025-05-07T20:32:29.5056588Z         if contiguous:
2025-05-07T20:32:29.5056679Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5056770Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5056841Z     
2025-05-07T20:32:29.5056936Z         if scale_ub is not None:
2025-05-07T20:32:29.5057089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5057229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5057306Z             )
2025-05-07T20:32:29.5057381Z         else:
2025-05-07T20:32:29.5057475Z             scale_ub_tensor = None
2025-05-07T20:32:29.5057548Z     
2025-05-07T20:32:29.5057679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5057768Z             op = silu_mul_quant
2025-05-07T20:32:29.5057854Z             if compiled:
2025-05-07T20:32:29.5057954Z                 op = torch.compile(op)
2025-05-07T20:32:29.5058059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5058176Z     
2025-05-07T20:32:29.5058266Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5058270Z 
2025-05-07T20:32:29.5058370Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5058498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5058637Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5058742Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5059302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5059403Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5059778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5060007Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5060364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5060462Z     kernel = self.compile(
2025-05-07T20:32:29.5060857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5061042Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5061176Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5061183Z 
2025-05-07T20:32:29.5061395Z self = <triton.compiler.compiler.ASTSource object at 0x7f31f96a12d0>
2025-05-07T20:32:29.5062198Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5062718Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7bc7e20>}
2025-05-07T20:32:29.5063497Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5063691Z context = <triton._C.libtriton.ir.context object at 0x7f30b7a21a30>
2025-05-07T20:32:29.5063699Z 
2025-05-07T20:32:29.5063875Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5064147Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5064254Z                            module_map=module_map)
2025-05-07T20:32:29.5064420Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5064518Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5064600Z E       ^
2025-05-07T20:32:29.5064966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5064973Z 
2025-05-07T20:32:29.5065400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5065405Z 
2025-05-07T20:32:29.5065513Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5065742Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5065860Z     T=2048,
2025-05-07T20:32:29.5065943Z     D=5120,
2025-05-07T20:32:29.5066026Z     scale_ub=None,
2025-05-07T20:32:29.5066114Z     contiguous=True,
2025-05-07T20:32:29.5066197Z     compiled=False,
2025-05-07T20:32:29.5066270Z )
2025-05-07T20:32:29.5066499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5069653Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5069665Z 
2025-05-07T20:32:29.5069756Z     @given(
2025-05-07T20:32:29.5069882Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5070077Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5070195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5070313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5070432Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5070551Z     )
2025-05-07T20:32:29.5070861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5070960Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5071036Z         self,
2025-05-07T20:32:29.5071113Z         T: int,
2025-05-07T20:32:29.5071195Z         D: int,
2025-05-07T20:32:29.5071293Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5071388Z         contiguous: bool,
2025-05-07T20:32:29.5071474Z         compiled: bool,
2025-05-07T20:32:29.5071554Z     ) -> None:
2025-05-07T20:32:29.5071653Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5071724Z     
2025-05-07T20:32:29.5071903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5071978Z     
2025-05-07T20:32:29.5072072Z >       x_sign = torch.sign(x)
2025-05-07T20:32:29.5073916Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5073928Z 
2025-05-07T20:32:29.5074049Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:29.5074054Z 
2025-05-07T20:32:29.5074161Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5074395Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5074477Z     T=16384,
2025-05-07T20:32:29.5074557Z     D=5120,
2025-05-07T20:32:29.5074640Z     scale_ub=None,
2025-05-07T20:32:29.5074726Z     contiguous=True,
2025-05-07T20:32:29.5074814Z     compiled=False,
2025-05-07T20:32:29.5074888Z )
2025-05-07T20:32:29.5075115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5075303Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5075307Z 
2025-05-07T20:32:29.5075387Z     @given(
2025-05-07T20:32:29.5075507Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5075609Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5075724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5075844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5075959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5076033Z     )
2025-05-07T20:32:29.5076294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5076388Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5076463Z         self,
2025-05-07T20:32:29.5076542Z         T: int,
2025-05-07T20:32:29.5076617Z         D: int,
2025-05-07T20:32:29.5076715Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5076810Z         contiguous: bool,
2025-05-07T20:32:29.5076943Z         compiled: bool,
2025-05-07T20:32:29.5077024Z     ) -> None:
2025-05-07T20:32:29.5077122Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5077195Z     
2025-05-07T20:32:29.5077367Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5079207Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5079251Z 
2025-05-07T20:32:29.5079409Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5079413Z 
2025-05-07T20:32:29.5079555Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5079787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5079867Z     T=4096,
2025-05-07T20:32:29.5079945Z     D=5120,
2025-05-07T20:32:29.5080028Z     scale_ub=None,
2025-05-07T20:32:29.5080196Z     contiguous=True,
2025-05-07T20:32:29.5080282Z     compiled=False,
2025-05-07T20:32:29.5080356Z )
2025-05-07T20:32:29.5080582Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5080760Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5080766Z 
2025-05-07T20:32:29.5080846Z     @given(
2025-05-07T20:32:29.5080964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5081064Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5081183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5081306Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5081427Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5081503Z     )
2025-05-07T20:32:29.5081756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5081850Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5081931Z         self,
2025-05-07T20:32:29.5082006Z         T: int,
2025-05-07T20:32:29.5082084Z         D: int,
2025-05-07T20:32:29.5082182Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5082273Z         contiguous: bool,
2025-05-07T20:32:29.5082362Z         compiled: bool,
2025-05-07T20:32:29.5082441Z     ) -> None:
2025-05-07T20:32:29.5082536Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5082610Z     
2025-05-07T20:32:29.5082782Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5084605Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5084617Z 
2025-05-07T20:32:29.5084736Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5084741Z 
2025-05-07T20:32:29.5084844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5085079Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5085156Z     T=2048,
2025-05-07T20:32:29.5085238Z     D=5120,
2025-05-07T20:32:29.5085321Z     scale_ub=None,
2025-05-07T20:32:29.5085408Z     contiguous=False,
2025-05-07T20:32:29.5085498Z     compiled=False,
2025-05-07T20:32:29.5085573Z )
2025-05-07T20:32:29.5085844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5086029Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.5086034Z 
2025-05-07T20:32:29.5086111Z     @given(
2025-05-07T20:32:29.5086230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5086332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5086447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5086568Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5086682Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5086796Z     )
2025-05-07T20:32:29.5087052Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5087146Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5087221Z         self,
2025-05-07T20:32:29.5087303Z         T: int,
2025-05-07T20:32:29.5087417Z         D: int,
2025-05-07T20:32:29.5087517Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5087645Z         contiguous: bool,
2025-05-07T20:32:29.5087732Z         compiled: bool,
2025-05-07T20:32:29.5087809Z     ) -> None:
2025-05-07T20:32:29.5087909Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5087981Z     
2025-05-07T20:32:29.5088152Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5089976Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5089986Z 
2025-05-07T20:32:29.5090110Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5090116Z 
2025-05-07T20:32:29.5090222Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5090451Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5090530Z     T=4096,
2025-05-07T20:32:29.5090607Z     D=7168,
2025-05-07T20:32:29.5090689Z     scale_ub=None,
2025-05-07T20:32:29.5090776Z     contiguous=True,
2025-05-07T20:32:29.5090859Z     compiled=True,
2025-05-07T20:32:29.5090931Z )
2025-05-07T20:32:29.5091156Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5091332Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.5091336Z 
2025-05-07T20:32:29.5091416Z     @given(
2025-05-07T20:32:29.5091535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5091635Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5091758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5091882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5091996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5092072Z     )
2025-05-07T20:32:29.5092324Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5092418Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5092496Z         self,
2025-05-07T20:32:29.5092572Z         T: int,
2025-05-07T20:32:29.5092650Z         D: int,
2025-05-07T20:32:29.5092748Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5092843Z         contiguous: bool,
2025-05-07T20:32:29.5092931Z         compiled: bool,
2025-05-07T20:32:29.5093008Z     ) -> None:
2025-05-07T20:32:29.5093103Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5093178Z     
2025-05-07T20:32:29.5093348Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5095589Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5095599Z 
2025-05-07T20:32:29.5095721Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5095764Z 
2025-05-07T20:32:29.5095870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5096105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5096182Z     T=2048,
2025-05-07T20:32:29.5096262Z     D=5120,
2025-05-07T20:32:29.5096344Z     scale_ub=1200.0,
2025-05-07T20:32:29.5096469Z     contiguous=False,
2025-05-07T20:32:29.5096559Z     compiled=False,
2025-05-07T20:32:29.5096632Z )
2025-05-07T20:32:29.5096892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5097077Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.5097081Z 
2025-05-07T20:32:29.5097158Z     @given(
2025-05-07T20:32:29.5097279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5097381Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5097495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5097614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5097731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5097805Z     )
2025-05-07T20:32:29.5098060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5098154Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5098232Z         self,
2025-05-07T20:32:29.5098313Z         T: int,
2025-05-07T20:32:29.5098394Z         D: int,
2025-05-07T20:32:29.5098495Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5098587Z         contiguous: bool,
2025-05-07T20:32:29.5098673Z         compiled: bool,
2025-05-07T20:32:29.5098750Z     ) -> None:
2025-05-07T20:32:29.5098853Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5098925Z     
2025-05-07T20:32:29.5099097Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5100924Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5100935Z 
2025-05-07T20:32:29.5101058Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5101062Z 
2025-05-07T20:32:29.5101166Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5101396Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5101475Z     T=4096,
2025-05-07T20:32:29.5101551Z     D=7168,
2025-05-07T20:32:29.5101634Z     scale_ub=1200.0,
2025-05-07T20:32:29.5101720Z     contiguous=True,
2025-05-07T20:32:29.5101803Z     compiled=False,
2025-05-07T20:32:29.5101878Z )
2025-05-07T20:32:29.5102103Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5102280Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.5102284Z 
2025-05-07T20:32:29.5102362Z     @given(
2025-05-07T20:32:29.5102480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5102583Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5102778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5102897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5103011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5103087Z     )
2025-05-07T20:32:29.5103339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5103433Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5103512Z         self,
2025-05-07T20:32:29.5103587Z         T: int,
2025-05-07T20:32:29.5103666Z         D: int,
2025-05-07T20:32:29.5103805Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5103894Z         contiguous: bool,
2025-05-07T20:32:29.5103983Z         compiled: bool,
2025-05-07T20:32:29.5104063Z     ) -> None:
2025-05-07T20:32:29.5104161Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5104235Z     
2025-05-07T20:32:29.5104405Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5106314Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5106326Z 
2025-05-07T20:32:29.5106446Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5106451Z 
2025-05-07T20:32:29.5106553Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5106784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5106862Z     T=16384,
2025-05-07T20:32:29.5106943Z     D=7168,
2025-05-07T20:32:29.5107028Z     scale_ub=None,
2025-05-07T20:32:29.5107118Z     contiguous=False,
2025-05-07T20:32:29.5107205Z     compiled=True,
2025-05-07T20:32:29.5107277Z )
2025-05-07T20:32:29.5107499Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5107686Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:29.5107691Z 
2025-05-07T20:32:29.5107768Z     @given(
2025-05-07T20:32:29.5107886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5107988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5108102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5108223Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5108336Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5108410Z     )
2025-05-07T20:32:29.5108669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5108767Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5108844Z         self,
2025-05-07T20:32:29.5108923Z         T: int,
2025-05-07T20:32:29.5108999Z         D: int,
2025-05-07T20:32:29.5109097Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5109189Z         contiguous: bool,
2025-05-07T20:32:29.5109276Z         compiled: bool,
2025-05-07T20:32:29.5109353Z     ) -> None:
2025-05-07T20:32:29.5109449Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5109521Z     
2025-05-07T20:32:29.5109691Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5111569Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5111579Z 
2025-05-07T20:32:29.5111702Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5111706Z 
2025-05-07T20:32:29.5111809Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5112039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5112120Z     T=4096,
2025-05-07T20:32:29.5112196Z     D=7168,
2025-05-07T20:32:29.5112279Z     scale_ub=None,
2025-05-07T20:32:29.5112370Z     contiguous=True,
2025-05-07T20:32:29.5112501Z     compiled=False,
2025-05-07T20:32:29.5112575Z )
2025-05-07T20:32:29.5112801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5112977Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5112981Z 
2025-05-07T20:32:29.5113104Z     @given(
2025-05-07T20:32:29.5113227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5113626Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5113758Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5113897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5114030Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5114112Z     )
2025-05-07T20:32:29.5114364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5114458Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5114536Z         self,
2025-05-07T20:32:29.5114614Z         T: int,
2025-05-07T20:32:29.5114694Z         D: int,
2025-05-07T20:32:29.5114792Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5114883Z         contiguous: bool,
2025-05-07T20:32:29.5114971Z         compiled: bool,
2025-05-07T20:32:29.5115047Z     ) -> None:
2025-05-07T20:32:29.5115141Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5115219Z     
2025-05-07T20:32:29.5115392Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5117230Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5117240Z 
2025-05-07T20:32:29.5117359Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5117363Z 
2025-05-07T20:32:29.5117466Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5117699Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5117779Z     T=16384,
2025-05-07T20:32:29.5117861Z     D=7168,
2025-05-07T20:32:29.5117946Z     scale_ub=None,
2025-05-07T20:32:29.5118030Z     contiguous=True,
2025-05-07T20:32:29.5118117Z     compiled=False,
2025-05-07T20:32:29.5118191Z )
2025-05-07T20:32:29.5118414Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5118596Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:29.5118600Z 
2025-05-07T20:32:29.5118676Z     @given(
2025-05-07T20:32:29.5118794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5118898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5119013Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5119134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5119248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5119321Z     )
2025-05-07T20:32:29.5119580Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5119742Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5119822Z         self,
2025-05-07T20:32:29.5119901Z         T: int,
2025-05-07T20:32:29.5119976Z         D: int,
2025-05-07T20:32:29.5120074Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5120232Z         contiguous: bool,
2025-05-07T20:32:29.5120318Z         compiled: bool,
2025-05-07T20:32:29.5120395Z     ) -> None:
2025-05-07T20:32:29.5120493Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5120565Z     
2025-05-07T20:32:29.5120735Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5122669Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5122728Z 
2025-05-07T20:32:29.5122851Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5122856Z 
2025-05-07T20:32:29.5122959Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5123188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5123269Z     T=16384,
2025-05-07T20:32:29.5123346Z     D=7168,
2025-05-07T20:32:29.5123432Z     scale_ub=1200.0,
2025-05-07T20:32:29.5123520Z     contiguous=True,
2025-05-07T20:32:29.5123604Z     compiled=False,
2025-05-07T20:32:29.5123680Z )
2025-05-07T20:32:29.5123924Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5124134Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.5124143Z 
2025-05-07T20:32:29.5124224Z     @given(
2025-05-07T20:32:29.5124345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5124444Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5124563Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5124679Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5124796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5124870Z     )
2025-05-07T20:32:29.5125123Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5125218Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5125300Z         self,
2025-05-07T20:32:29.5125376Z         T: int,
2025-05-07T20:32:29.5125456Z         D: int,
2025-05-07T20:32:29.5125554Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5125643Z         contiguous: bool,
2025-05-07T20:32:29.5125732Z         compiled: bool,
2025-05-07T20:32:29.5125812Z     ) -> None:
2025-05-07T20:32:29.5125907Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5125984Z     
2025-05-07T20:32:29.5126157Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5127985Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5127992Z 
2025-05-07T20:32:29.5128111Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5128115Z 
2025-05-07T20:32:29.5128224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5128499Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5128580Z     T=128,
2025-05-07T20:32:29.5128660Z     D=5120,
2025-05-07T20:32:29.5128742Z     scale_ub=1200.0,
2025-05-07T20:32:29.5128829Z     contiguous=False,
2025-05-07T20:32:29.5128914Z     compiled=False,
2025-05-07T20:32:29.5128987Z )
2025-05-07T20:32:29.5129208Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5129389Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:29.5129393Z 
2025-05-07T20:32:29.5129469Z     @given(
2025-05-07T20:32:29.5129633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5129732Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5129847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5129969Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5130084Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5130197Z     )
2025-05-07T20:32:29.5130561Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5130657Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5130733Z         self,
2025-05-07T20:32:29.5130812Z         T: int,
2025-05-07T20:32:29.5130887Z         D: int,
2025-05-07T20:32:29.5130989Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5131079Z         contiguous: bool,
2025-05-07T20:32:29.5131165Z         compiled: bool,
2025-05-07T20:32:29.5131244Z     ) -> None:
2025-05-07T20:32:29.5131339Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5131411Z     
2025-05-07T20:32:29.5131586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5131660Z     
2025-05-07T20:32:29.5131752Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5131882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5131971Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5132055Z         x0 = x[:, :D]
2025-05-07T20:32:29.5132137Z         x1 = x[:, D:]
2025-05-07T20:32:29.5132212Z     
2025-05-07T20:32:29.5132298Z         if contiguous:
2025-05-07T20:32:29.5132395Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5132494Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5132567Z     
2025-05-07T20:32:29.5132658Z         if scale_ub is not None:
2025-05-07T20:32:29.5132767Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5132904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5132980Z             )
2025-05-07T20:32:29.5133059Z         else:
2025-05-07T20:32:29.5133153Z             scale_ub_tensor = None
2025-05-07T20:32:29.5133226Z     
2025-05-07T20:32:29.5133360Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5133451Z             op = silu_mul_quant
2025-05-07T20:32:29.5133535Z             if compiled:
2025-05-07T20:32:29.5133642Z                 op = torch.compile(op)
2025-05-07T20:32:29.5133752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5133828Z     
2025-05-07T20:32:29.5133922Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5133926Z 
2025-05-07T20:32:29.5134024Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5134157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5134258Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5134359Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5134881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5134983Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5135356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5135586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5135937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5136084Z     kernel = self.compile(
2025-05-07T20:32:29.5136484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5136665Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5136796Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5136801Z 
2025-05-07T20:32:29.5137010Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7e0a250>
2025-05-07T20:32:29.5137814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5138372Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7754ae0>}
2025-05-07T20:32:29.5139220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5139419Z context = <triton._C.libtriton.ir.context object at 0x7f30b76d5b70>
2025-05-07T20:32:29.5139423Z 
2025-05-07T20:32:29.5139592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5139869Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5139980Z                            module_map=module_map)
2025-05-07T20:32:29.5140150Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5140250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5140327Z E       ^
2025-05-07T20:32:29.5140694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5140704Z 
2025-05-07T20:32:29.5141134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5141138Z 
2025-05-07T20:32:29.5141244Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5141477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5141554Z     T=2048,
2025-05-07T20:32:29.5141634Z     D=7168,
2025-05-07T20:32:29.5141717Z     scale_ub=None,
2025-05-07T20:32:29.5141804Z     contiguous=False,
2025-05-07T20:32:29.5141894Z     compiled=False,
2025-05-07T20:32:29.5141970Z )
2025-05-07T20:32:29.5142192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5142374Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:29.5142379Z 
2025-05-07T20:32:29.5142455Z     @given(
2025-05-07T20:32:29.5142578Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5142684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5142802Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5142922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5143038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5143113Z     )
2025-05-07T20:32:29.5143370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5143464Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5143542Z         self,
2025-05-07T20:32:29.5143620Z         T: int,
2025-05-07T20:32:29.5143699Z         D: int,
2025-05-07T20:32:29.5143798Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5143895Z         contiguous: bool,
2025-05-07T20:32:29.5143982Z         compiled: bool,
2025-05-07T20:32:29.5144060Z     ) -> None:
2025-05-07T20:32:29.5144159Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5144231Z     
2025-05-07T20:32:29.5144455Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5146288Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5146330Z 
2025-05-07T20:32:29.5146454Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5146458Z 
2025-05-07T20:32:29.5146562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5146790Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5146907Z     T=128,
2025-05-07T20:32:29.5146985Z     D=7168,
2025-05-07T20:32:29.5147069Z     scale_ub=1200.0,
2025-05-07T20:32:29.5147199Z     contiguous=True,
2025-05-07T20:32:29.5147285Z     compiled=True,
2025-05-07T20:32:29.5147358Z )
2025-05-07T20:32:29.5147584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5147755Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.5147759Z 
2025-05-07T20:32:29.5147838Z     @given(
2025-05-07T20:32:29.5147958Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5148058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5148180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5148296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5148410Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5148485Z     )
2025-05-07T20:32:29.5148737Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5148841Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5148919Z         self,
2025-05-07T20:32:29.5148997Z         T: int,
2025-05-07T20:32:29.5149078Z         D: int,
2025-05-07T20:32:29.5149177Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5149266Z         contiguous: bool,
2025-05-07T20:32:29.5149354Z         compiled: bool,
2025-05-07T20:32:29.5149431Z     ) -> None:
2025-05-07T20:32:29.5149527Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5149602Z     
2025-05-07T20:32:29.5149773Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5149847Z     
2025-05-07T20:32:29.5149944Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5150071Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5150159Z         x = x_sign * x_clamp
2025-05-07T20:32:29.5150243Z         x0 = x[:, :D]
2025-05-07T20:32:29.5150322Z         x1 = x[:, D:]
2025-05-07T20:32:29.5150396Z     
2025-05-07T20:32:29.5150483Z         if contiguous:
2025-05-07T20:32:29.5150577Z             x0 = x0.contiguous()
2025-05-07T20:32:29.5150672Z             x1 = x1.contiguous()
2025-05-07T20:32:29.5150746Z     
2025-05-07T20:32:29.5150837Z         if scale_ub is not None:
2025-05-07T20:32:29.5150948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:29.5151085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:29.5151160Z             )
2025-05-07T20:32:29.5151239Z         else:
2025-05-07T20:32:29.5151333Z             scale_ub_tensor = None
2025-05-07T20:32:29.5151404Z     
2025-05-07T20:32:29.5151539Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:29.5151633Z             op = silu_mul_quant
2025-05-07T20:32:29.5151720Z             if compiled:
2025-05-07T20:32:29.5151821Z                 op = torch.compile(op)
2025-05-07T20:32:29.5151928Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5152003Z     
2025-05-07T20:32:29.5152100Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:29.5152104Z 
2025-05-07T20:32:29.5152251Z moe/activation_test.py:117: 
2025-05-07T20:32:29.5152388Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5152489Z moe/activation_test.py:115: in fn
2025-05-07T20:32:29.5152591Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:29.5152973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:29.5153066Z     return fn(*args, **kwargs)
2025-05-07T20:32:29.5153577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:29.5153718Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:29.5154088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:29.5154323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:29.5154750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:29.5154846Z     kernel = self.compile(
2025-05-07T20:32:29.5155244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:29.5155423Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:29.5155556Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:29.5155560Z 
2025-05-07T20:32:29.5155770Z self = <triton.compiler.compiler.ASTSource object at 0x7f30b7688cd0>
2025-05-07T20:32:29.5156573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:29.5157100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f32d7267ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f30b7610040>}
2025-05-07T20:32:29.5157873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:29.5158071Z context = <triton._C.libtriton.ir.context object at 0x7f30b78aa0b0>
2025-05-07T20:32:29.5158075Z 
2025-05-07T20:32:29.5158244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:29.5158522Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:29.5158634Z                            module_map=module_map)
2025-05-07T20:32:29.5158798Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:29.5158902Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:29.5158980Z E       ^
2025-05-07T20:32:29.5159351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:29.5159356Z 
2025-05-07T20:32:29.5159787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:29.5159792Z 
2025-05-07T20:32:29.5159897Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5160224Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5160303Z     T=128,
2025-05-07T20:32:29.5160380Z     D=7168,
2025-05-07T20:32:29.5160467Z     scale_ub=1200.0,
2025-05-07T20:32:29.5160555Z     contiguous=True,
2025-05-07T20:32:29.5160639Z     compiled=False,
2025-05-07T20:32:29.5160713Z )
2025-05-07T20:32:29.5160936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5161110Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:29.5161120Z 
2025-05-07T20:32:29.5161197Z     @given(
2025-05-07T20:32:29.5161362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5161470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5161586Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5161705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5161823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5161899Z     )
2025-05-07T20:32:29.5162153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5162251Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5162366Z         self,
2025-05-07T20:32:29.5162442Z         T: int,
2025-05-07T20:32:29.5162520Z         D: int,
2025-05-07T20:32:29.5162618Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5162710Z         contiguous: bool,
2025-05-07T20:32:29.5162796Z         compiled: bool,
2025-05-07T20:32:29.5162875Z     ) -> None:
2025-05-07T20:32:29.5163039Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5163113Z     
2025-05-07T20:32:29.5163327Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5163406Z     
2025-05-07T20:32:29.5163499Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5163625Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5165482Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5165490Z 
2025-05-07T20:32:29.5165610Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.5165617Z 
2025-05-07T20:32:29.5165730Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5165963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5166044Z     T=128,
2025-05-07T20:32:29.5166121Z     D=5120,
2025-05-07T20:32:29.5166204Z     scale_ub=1200.0,
2025-05-07T20:32:29.5166293Z     contiguous=True,
2025-05-07T20:32:29.5166375Z     compiled=True,
2025-05-07T20:32:29.5166447Z )
2025-05-07T20:32:29.5166675Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5166849Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:29.5166856Z 
2025-05-07T20:32:29.5166932Z     @given(
2025-05-07T20:32:29.5167054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5167153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5167273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5167393Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5167509Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5167589Z     )
2025-05-07T20:32:29.5167843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5167935Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5168014Z         self,
2025-05-07T20:32:29.5168091Z         T: int,
2025-05-07T20:32:29.5168166Z         D: int,
2025-05-07T20:32:29.5168267Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5168358Z         contiguous: bool,
2025-05-07T20:32:29.5168443Z         compiled: bool,
2025-05-07T20:32:29.5168526Z     ) -> None:
2025-05-07T20:32:29.5168620Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5168695Z     
2025-05-07T20:32:29.5168865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5168940Z     
2025-05-07T20:32:29.5169036Z         x_sign = torch.sign(x)
2025-05-07T20:32:29.5169165Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:29.5171038Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5171084Z 
2025-05-07T20:32:29.5171206Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:29.5171210Z 
2025-05-07T20:32:29.5171313Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:29.5171547Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:29.5171629Z     T=128,
2025-05-07T20:32:29.5171745Z     D=7168,
2025-05-07T20:32:29.5171830Z     scale_ub=None,
2025-05-07T20:32:29.5171918Z     contiguous=True,
2025-05-07T20:32:29.5172039Z     compiled=True,
2025-05-07T20:32:29.5172114Z )
2025-05-07T20:32:29.5172338Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:29.5172511Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:29.5172516Z 
2025-05-07T20:32:29.5172592Z     @given(
2025-05-07T20:32:29.5172710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:29.5172811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:29.5172926Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:29.5173045Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:29.5173162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:29.5173235Z     )
2025-05-07T20:32:29.5173492Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:29.5173589Z     def test_silu_mul_quant(
2025-05-07T20:32:29.5173664Z         self,
2025-05-07T20:32:29.5173747Z         T: int,
2025-05-07T20:32:29.5173825Z         D: int,
2025-05-07T20:32:29.5173924Z         scale_ub: Optional[float],
2025-05-07T20:32:29.5174017Z         contiguous: bool,
2025-05-07T20:32:29.5174102Z         compiled: bool,
2025-05-07T20:32:29.5174179Z     ) -> None:
2025-05-07T20:32:29.5174278Z         torch.manual_seed(2025)
2025-05-07T20:32:29.5174350Z     
2025-05-07T20:32:29.5174522Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:29.5176352Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:29.5176363Z 
2025-05-07T20:32:29.5176484Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:29.5176623Z =============================== warnings summary ===============================
2025-05-07T20:32:29.5176942Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:29.5177256Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:29.5177565Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:29.5178468Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:29.5178757Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:29.5178762Z 
2025-05-07T20:32:29.5178980Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:29.5179155Z ================= 1 failed, 1 deselected, 3 warnings in 13.80s =================
2025-05-07T20:32:31.1966280Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:31.2584029Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:32:31.2584366Z 
2025-05-07T20:32:33.2608856Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:35.4122145Z ============================= test session starts ==============================
2025-05-07T20:32:35.4123263Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:35.4123822Z cachedir: .pytest_cache
2025-05-07T20:32:35.4124604Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:35.4125557Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:35.4125985Z plugins: hypothesis-6.131.14
2025-05-07T20:32:36.9729874Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:37.0699227Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:37.0699811Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:37.0700125Z 
2025-05-07T20:32:39.1693238Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.1694273Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.1694705Z     T=1,
2025-05-07T20:32:39.1694911Z     D=5120,
2025-05-07T20:32:39.1695121Z     scale_ub=None,
2025-05-07T20:32:39.1695343Z     contiguous=True,
2025-05-07T20:32:39.1695586Z     compiled=True,
2025-05-07T20:32:39.1695807Z )
2025-05-07T20:32:39.1696144Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1696658Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:39.1696930Z 
2025-05-07T20:32:39.1697021Z     @given(
2025-05-07T20:32:39.1697279Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1697604Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1697935Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1698290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1698637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1698952Z     )
2025-05-07T20:32:39.1699333Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1699792Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1700052Z         self,
2025-05-07T20:32:39.1700264Z         T: int,
2025-05-07T20:32:39.1700470Z         D: int,
2025-05-07T20:32:39.1700708Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1700999Z         contiguous: bool,
2025-05-07T20:32:39.1701249Z         compiled: bool,
2025-05-07T20:32:39.1701490Z     ) -> None:
2025-05-07T20:32:39.1701722Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1701973Z     
2025-05-07T20:32:39.1702268Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1702630Z     
2025-05-07T20:32:39.1702840Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1703145Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1703478Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1703735Z         x0 = x[:, :D]
2025-05-07T20:32:39.1703960Z         x1 = x[:, D:]
2025-05-07T20:32:39.1704508Z     
2025-05-07T20:32:39.1704714Z         if contiguous:
2025-05-07T20:32:39.1704955Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1705232Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1705488Z     
2025-05-07T20:32:39.1705688Z         if scale_ub is not None:
2025-05-07T20:32:39.1705982Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1706338Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1706660Z             )
2025-05-07T20:32:39.1706902Z         else:
2025-05-07T20:32:39.1707141Z             scale_ub_tensor = None
2025-05-07T20:32:39.1707491Z     
2025-05-07T20:32:39.1707743Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1708081Z             op = silu_mul_quant
2025-05-07T20:32:39.1708348Z             if compiled:
2025-05-07T20:32:39.1708606Z                 op = torch.compile(op)
2025-05-07T20:32:39.1709031Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1709325Z     
2025-05-07T20:32:39.1709610Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.1709917Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.1710228Z     
2025-05-07T20:32:39.1710480Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1710834Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.1711146Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.1711477Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.1711859Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.1712194Z     
2025-05-07T20:32:39.1712417Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.1712622Z 
2025-05-07T20:32:39.1712729Z moe/activation_test.py:126: 
2025-05-07T20:32:39.1713051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1713681Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.1714033Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.1714870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.1715661Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.1716245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1716974Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1717730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.1718494Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.1719256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.1719933Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.1720647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.1721189Z     fn()
2025-05-07T20:32:39.1721714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.1722325Z     self.fn.run(
2025-05-07T20:32:39.1722820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1723368Z     kernel = self.compile(
2025-05-07T20:32:39.1723941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1724626Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1725052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1725294Z 
2025-05-07T20:32:39.1725586Z self = <triton.compiler.compiler.ASTSource object at 0x7f08a163a270>
2025-05-07T20:32:39.1726712Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1728157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f089ac76700>}
2025-05-07T20:32:39.1729543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1730669Z context = <triton._C.libtriton.ir.context object at 0x7f089b1f8ef0>
2025-05-07T20:32:39.1730970Z 
2025-05-07T20:32:39.1731144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1731804Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1732301Z                            module_map=module_map)
2025-05-07T20:32:39.1732683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1733070Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.1733358Z E       ^
2025-05-07T20:32:39.1733846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1734315Z 
2025-05-07T20:32:39.1734748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.1735291Z 
2025-05-07T20:32:39.1735401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.1735841Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.1736265Z     T=2048,
2025-05-07T20:32:39.1736469Z     D=5120,
2025-05-07T20:32:39.1736676Z     scale_ub=1200.0,
2025-05-07T20:32:39.1736919Z     contiguous=True,
2025-05-07T20:32:39.1737155Z     compiled=False,
2025-05-07T20:32:39.1737376Z )
2025-05-07T20:32:39.1737718Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.1738244Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:39.1738537Z 
2025-05-07T20:32:39.1738619Z     @given(
2025-05-07T20:32:39.1738871Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.1739199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.1739535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.1739888Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.1740239Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.1740542Z     )
2025-05-07T20:32:39.1740915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.1741383Z     def test_silu_mul_quant(
2025-05-07T20:32:39.1741640Z         self,
2025-05-07T20:32:39.1741851Z         T: int,
2025-05-07T20:32:39.1742103Z         D: int,
2025-05-07T20:32:39.1742431Z         scale_ub: Optional[float],
2025-05-07T20:32:39.1742827Z         contiguous: bool,
2025-05-07T20:32:39.1743181Z         compiled: bool,
2025-05-07T20:32:39.1743418Z     ) -> None:
2025-05-07T20:32:39.1743648Z         torch.manual_seed(2025)
2025-05-07T20:32:39.1743909Z     
2025-05-07T20:32:39.1744381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.1744915Z     
2025-05-07T20:32:39.1745174Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.1745592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.1745914Z         x = x_sign * x_clamp
2025-05-07T20:32:39.1746213Z         x0 = x[:, :D]
2025-05-07T20:32:39.1746443Z         x1 = x[:, D:]
2025-05-07T20:32:39.1746674Z     
2025-05-07T20:32:39.1746901Z         if contiguous:
2025-05-07T20:32:39.1747145Z             x0 = x0.contiguous()
2025-05-07T20:32:39.1747481Z             x1 = x1.contiguous()
2025-05-07T20:32:39.1747738Z     
2025-05-07T20:32:39.1747943Z         if scale_ub is not None:
2025-05-07T20:32:39.1748225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.1748587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.1748916Z             )
2025-05-07T20:32:39.1749124Z         else:
2025-05-07T20:32:39.1749345Z             scale_ub_tensor = None
2025-05-07T20:32:39.1749620Z     
2025-05-07T20:32:39.1749868Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.1750243Z             op = silu_mul_quant
2025-05-07T20:32:39.1750515Z             if compiled:
2025-05-07T20:32:39.1750780Z                 op = torch.compile(op)
2025-05-07T20:32:39.1751091Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1751385Z     
2025-05-07T20:32:39.1751593Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.1751810Z 
2025-05-07T20:32:39.1751923Z moe/activation_test.py:117: 
2025-05-07T20:32:39.1752279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1752632Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.1752929Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.1753641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.1754360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.1754922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.1755636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.1756324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.1756876Z     kernel = self.compile(
2025-05-07T20:32:39.1757446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.1758126Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.1758548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.1758793Z 
2025-05-07T20:32:39.1759006Z self = <triton.compiler.compiler.ASTSource object at 0x7f089ac09090>
2025-05-07T20:32:39.1760238Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.1761671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f089ab2a020>}
2025-05-07T20:32:39.1763059Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.1764115Z context = <triton._C.libtriton.ir.context object at 0x7f089b1e8a30>
2025-05-07T20:32:39.1764418Z 
2025-05-07T20:32:39.1764592Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.1765136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.1765618Z                            module_map=module_map)
2025-05-07T20:32:39.1765998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.1766375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.1766667Z E       ^
2025-05-07T20:32:39.1767175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.1767648Z 
2025-05-07T20:32:39.1768133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8306922Z 
2025-05-07T20:32:39.8307381Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8308042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8308643Z     T=2048,
2025-05-07T20:32:39.8308906Z     D=5120,
2025-05-07T20:32:39.8309210Z     scale_ub=1200.0,
2025-05-07T20:32:39.8309515Z     contiguous=True,
2025-05-07T20:32:39.8309817Z     compiled=True,
2025-05-07T20:32:39.8310074Z )
2025-05-07T20:32:39.8310416Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8311239Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.8311526Z 
2025-05-07T20:32:39.8311610Z     @given(
2025-05-07T20:32:39.8311862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8312197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8312663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8313027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8313766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8314081Z     )
2025-05-07T20:32:39.8314450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8314924Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8315189Z         self,
2025-05-07T20:32:39.8315393Z         T: int,
2025-05-07T20:32:39.8315606Z         D: int,
2025-05-07T20:32:39.8315841Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8316126Z         contiguous: bool,
2025-05-07T20:32:39.8316386Z         compiled: bool,
2025-05-07T20:32:39.8316630Z     ) -> None:
2025-05-07T20:32:39.8316855Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8317114Z     
2025-05-07T20:32:39.8317408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8317763Z     
2025-05-07T20:32:39.8317976Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8318295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8318627Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8318880Z         x0 = x[:, :D]
2025-05-07T20:32:39.8319113Z         x1 = x[:, D:]
2025-05-07T20:32:39.8319336Z     
2025-05-07T20:32:39.8319532Z         if contiguous:
2025-05-07T20:32:39.8319782Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8320059Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8320436Z     
2025-05-07T20:32:39.8320644Z         if scale_ub is not None:
2025-05-07T20:32:39.8320940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8321300Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8321637Z             )
2025-05-07T20:32:39.8321848Z         else:
2025-05-07T20:32:39.8322069Z             scale_ub_tensor = None
2025-05-07T20:32:39.8322344Z     
2025-05-07T20:32:39.8322595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8322928Z             op = silu_mul_quant
2025-05-07T20:32:39.8323201Z             if compiled:
2025-05-07T20:32:39.8323471Z                 op = torch.compile(op)
2025-05-07T20:32:39.8323785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8324081Z     
2025-05-07T20:32:39.8324289Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.8324598Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.8324906Z     
2025-05-07T20:32:39.8325162Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8325521Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.8325833Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.8326175Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.8326567Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.8326941Z     
2025-05-07T20:32:39.8327165Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.8327373Z 
2025-05-07T20:32:39.8327490Z moe/activation_test.py:126: 
2025-05-07T20:32:39.8327903Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8328268Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.8328621Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.8329448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.8330228Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.8330811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8331591Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8332319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.8333077Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.8334472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.8335156Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.8335894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.8336525Z     fn()
2025-05-07T20:32:39.8337144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.8337859Z     self.fn.run(
2025-05-07T20:32:39.8338420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8339069Z     kernel = self.compile(
2025-05-07T20:32:39.8339725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8340527Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8341005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8341287Z 
2025-05-07T20:32:39.8341527Z self = <triton.compiler.compiler.ASTSource object at 0x7f089ac0a0d0>
2025-05-07T20:32:39.8342657Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8344104Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f089ab34720>}
2025-05-07T20:32:39.8345500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8346574Z context = <triton._C.libtriton.ir.context object at 0x7f08998ee3b0>
2025-05-07T20:32:39.8346883Z 
2025-05-07T20:32:39.8347065Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8347622Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8348118Z                            module_map=module_map)
2025-05-07T20:32:39.8348515Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8348900Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.8349184Z E       ^
2025-05-07T20:32:39.8349679Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8350163Z 
2025-05-07T20:32:39.8350601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8351140Z 
2025-05-07T20:32:39.8351261Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8351756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8352197Z     T=16384,
2025-05-07T20:32:39.8352413Z     D=7168,
2025-05-07T20:32:39.8352619Z     scale_ub=1200.0,
2025-05-07T20:32:39.8352866Z     contiguous=False,
2025-05-07T20:32:39.8353116Z     compiled=False,
2025-05-07T20:32:39.8353334Z )
2025-05-07T20:32:39.8353678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8354218Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.8354518Z 
2025-05-07T20:32:39.8354654Z     @given(
2025-05-07T20:32:39.8354899Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8355239Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8355576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8355926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8356324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8356640Z     )
2025-05-07T20:32:39.8357095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8357569Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8357831Z         self,
2025-05-07T20:32:39.8358043Z         T: int,
2025-05-07T20:32:39.8358255Z         D: int,
2025-05-07T20:32:39.8358494Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8358789Z         contiguous: bool,
2025-05-07T20:32:39.8359045Z         compiled: bool,
2025-05-07T20:32:39.8359285Z     ) -> None:
2025-05-07T20:32:39.8359518Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8359777Z     
2025-05-07T20:32:39.8360068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8360512Z     
2025-05-07T20:32:39.8360720Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8361035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8361375Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8361630Z         x0 = x[:, :D]
2025-05-07T20:32:39.8361871Z         x1 = x[:, D:]
2025-05-07T20:32:39.8362102Z     
2025-05-07T20:32:39.8362301Z         if contiguous:
2025-05-07T20:32:39.8362555Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8362841Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8363097Z     
2025-05-07T20:32:39.8363307Z         if scale_ub is not None:
2025-05-07T20:32:39.8363605Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8363969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8364297Z             )
2025-05-07T20:32:39.8364512Z         else:
2025-05-07T20:32:39.8364745Z             scale_ub_tensor = None
2025-05-07T20:32:39.8365012Z     
2025-05-07T20:32:39.8365267Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8365608Z             op = silu_mul_quant
2025-05-07T20:32:39.8365875Z             if compiled:
2025-05-07T20:32:39.8366147Z                 op = torch.compile(op)
2025-05-07T20:32:39.8366477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8366776Z     
2025-05-07T20:32:39.8366992Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.8367169Z 
2025-05-07T20:32:39.8367286Z moe/activation_test.py:117: 
2025-05-07T20:32:39.8367601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8367962Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.8368270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8369004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.8369728Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.8370303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8371029Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8371787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8372352Z     kernel = self.compile(
2025-05-07T20:32:39.8372931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8373628Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8374048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8374293Z 
2025-05-07T20:32:39.8374513Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899bb1220>
2025-05-07T20:32:39.8375695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8377134Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899a13880>}
2025-05-07T20:32:39.8378623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8379692Z context = <triton._C.libtriton.ir.context object at 0x7f08995224b0>
2025-05-07T20:32:39.8380003Z 
2025-05-07T20:32:39.8380182Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8380741Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8381243Z                            module_map=module_map)
2025-05-07T20:32:39.8381627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8382006Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.8382286Z E       ^
2025-05-07T20:32:39.8382780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8383262Z 
2025-05-07T20:32:39.8383701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.5262909Z 
2025-05-07T20:32:40.5263656Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.5264334Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.5264913Z     T=1,
2025-05-07T20:32:40.5265112Z     D=7168,
2025-05-07T20:32:40.5265325Z     scale_ub=None,
2025-05-07T20:32:40.5265553Z     contiguous=True,
2025-05-07T20:32:40.5265809Z     compiled=True,
2025-05-07T20:32:40.5272121Z )
2025-05-07T20:32:40.5272503Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.5273015Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:40.5273301Z 
2025-05-07T20:32:40.5273394Z     @given(
2025-05-07T20:32:40.5273648Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.5273986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.5274314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.5274672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.5275025Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.5275326Z     )
2025-05-07T20:32:40.5275707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.5276177Z     def test_silu_mul_quant(
2025-05-07T20:32:40.5276432Z         self,
2025-05-07T20:32:40.5276647Z         T: int,
2025-05-07T20:32:40.5276859Z         D: int,
2025-05-07T20:32:40.5277086Z         scale_ub: Optional[float],
2025-05-07T20:32:40.5277378Z         contiguous: bool,
2025-05-07T20:32:40.5277636Z         compiled: bool,
2025-05-07T20:32:40.5277873Z     ) -> None:
2025-05-07T20:32:40.5278103Z         torch.manual_seed(2025)
2025-05-07T20:32:40.5278366Z     
2025-05-07T20:32:40.5278940Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.5279312Z     
2025-05-07T20:32:40.5279522Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.5279825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.5280254Z         x = x_sign * x_clamp
2025-05-07T20:32:40.5280517Z         x0 = x[:, :D]
2025-05-07T20:32:40.5280742Z         x1 = x[:, D:]
2025-05-07T20:32:40.5280963Z     
2025-05-07T20:32:40.5281162Z         if contiguous:
2025-05-07T20:32:40.5281408Z             x0 = x0.contiguous()
2025-05-07T20:32:40.5281677Z             x1 = x1.contiguous()
2025-05-07T20:32:40.5282031Z     
2025-05-07T20:32:40.5282238Z         if scale_ub is not None:
2025-05-07T20:32:40.5282523Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.5282882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.5283214Z             )
2025-05-07T20:32:40.5283414Z         else:
2025-05-07T20:32:40.5283735Z             scale_ub_tensor = None
2025-05-07T20:32:40.5284007Z     
2025-05-07T20:32:40.5284326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.5284664Z             op = silu_mul_quant
2025-05-07T20:32:40.5284933Z             if compiled:
2025-05-07T20:32:40.5285190Z                 op = torch.compile(op)
2025-05-07T20:32:40.5285504Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5285796Z     
2025-05-07T20:32:40.5285996Z         y_fp8, y_scale = fn()
2025-05-07T20:32:40.5286302Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:40.5286616Z     
2025-05-07T20:32:40.5286878Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.5287226Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:40.5287540Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:40.5287877Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:40.5288249Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.5288583Z     
2025-05-07T20:32:40.5288802Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:40.5289008Z 
2025-05-07T20:32:40.5289115Z moe/activation_test.py:126: 
2025-05-07T20:32:40.5289440Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5289801Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:40.5290154Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:40.5290973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:40.5291768Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:40.5292348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.5293059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.5293790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:40.5294553Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:40.5295322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:40.5295991Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:40.5296628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:40.5297175Z     fn()
2025-05-07T20:32:40.5297714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:40.5298320Z     self.fn.run(
2025-05-07T20:32:40.5298812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.5299372Z     kernel = self.compile(
2025-05-07T20:32:40.5299990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.5300686Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.5301106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5301347Z 
2025-05-07T20:32:40.5301568Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899bb3950>
2025-05-07T20:32:40.5302689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.5304185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899ace480>}
2025-05-07T20:32:40.5305622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.5306730Z context = <triton._C.libtriton.ir.context object at 0x7f08994ac3b0>
2025-05-07T20:32:40.5307029Z 
2025-05-07T20:32:40.5307210Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.5307750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.5308240Z                            module_map=module_map)
2025-05-07T20:32:40.5308622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.5308990Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:40.5309273Z E       ^
2025-05-07T20:32:40.5309758Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.5310222Z 
2025-05-07T20:32:40.5310663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.5311197Z 
2025-05-07T20:32:40.5311305Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.5311741Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.5312163Z     T=4096,
2025-05-07T20:32:40.5312357Z     D=5120,
2025-05-07T20:32:40.5312561Z     scale_ub=None,
2025-05-07T20:32:40.5312791Z     contiguous=False,
2025-05-07T20:32:40.5313024Z     compiled=False,
2025-05-07T20:32:40.5313242Z )
2025-05-07T20:32:40.5313976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.5314497Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:40.5314788Z 
2025-05-07T20:32:40.5314870Z     @given(
2025-05-07T20:32:40.5315115Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.5315446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.5315768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.5316126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.5316477Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.5316773Z     )
2025-05-07T20:32:40.5317143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.5317605Z     def test_silu_mul_quant(
2025-05-07T20:32:40.5317853Z         self,
2025-05-07T20:32:40.5318062Z         T: int,
2025-05-07T20:32:40.5318276Z         D: int,
2025-05-07T20:32:40.5318500Z         scale_ub: Optional[float],
2025-05-07T20:32:40.5318792Z         contiguous: bool,
2025-05-07T20:32:40.5319049Z         compiled: bool,
2025-05-07T20:32:40.5319285Z     ) -> None:
2025-05-07T20:32:40.5319506Z         torch.manual_seed(2025)
2025-05-07T20:32:40.5319766Z     
2025-05-07T20:32:40.5320056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.5320481Z     
2025-05-07T20:32:40.5320686Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.5321091Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.5321412Z         x = x_sign * x_clamp
2025-05-07T20:32:40.5321664Z         x0 = x[:, :D]
2025-05-07T20:32:40.5321896Z         x1 = x[:, D:]
2025-05-07T20:32:40.5322109Z     
2025-05-07T20:32:40.5322310Z         if contiguous:
2025-05-07T20:32:40.5322557Z             x0 = x0.contiguous()
2025-05-07T20:32:40.5322827Z             x1 = x1.contiguous()
2025-05-07T20:32:40.5323082Z     
2025-05-07T20:32:40.5323285Z         if scale_ub is not None:
2025-05-07T20:32:40.5323569Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.5323987Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.5324311Z             )
2025-05-07T20:32:40.5324518Z         else:
2025-05-07T20:32:40.5324733Z             scale_ub_tensor = None
2025-05-07T20:32:40.5325001Z     
2025-05-07T20:32:40.5325246Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.5325641Z             op = silu_mul_quant
2025-05-07T20:32:40.5325908Z             if compiled:
2025-05-07T20:32:40.5326224Z                 op = torch.compile(op)
2025-05-07T20:32:40.5326530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5326823Z     
2025-05-07T20:32:40.5327029Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.5327202Z 
2025-05-07T20:32:40.5327306Z moe/activation_test.py:117: 
2025-05-07T20:32:40.5327614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5327970Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.5328272Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.5328988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.5329709Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.5330274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.5330987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.5331681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.5332237Z     kernel = self.compile(
2025-05-07T20:32:40.5332803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.5333485Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.5333902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.5334149Z 
2025-05-07T20:32:40.5334363Z self = <triton.compiler.compiler.ASTSource object at 0x7f08991d0cb0>
2025-05-07T20:32:40.5335483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.5336924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899af23e0>}
2025-05-07T20:32:40.5338344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.5339402Z context = <triton._C.libtriton.ir.context object at 0x7f08994d3eb0>
2025-05-07T20:32:40.5339698Z 
2025-05-07T20:32:40.5339882Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.5340430Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.5340912Z                            module_map=module_map)
2025-05-07T20:32:40.5341298Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.5341723Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.5341993Z E       ^
2025-05-07T20:32:40.5342480Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.5342948Z 
2025-05-07T20:32:40.5343389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2348255Z 
2025-05-07T20:32:41.2348623Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2349139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2350000Z     T=4096,
2025-05-07T20:32:41.2350205Z     D=7168,
2025-05-07T20:32:41.2350408Z     scale_ub=None,
2025-05-07T20:32:41.2350634Z     contiguous=False,
2025-05-07T20:32:41.2350870Z     compiled=False,
2025-05-07T20:32:41.2351087Z )
2025-05-07T20:32:41.2351430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2352062Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.2352503Z 
2025-05-07T20:32:41.2352588Z     @given(
2025-05-07T20:32:41.2352834Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2353180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2353502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2353851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2354202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2354504Z     )
2025-05-07T20:32:41.2354865Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2355330Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2355587Z         self,
2025-05-07T20:32:41.2355786Z         T: int,
2025-05-07T20:32:41.2355997Z         D: int,
2025-05-07T20:32:41.2356228Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2356512Z         contiguous: bool,
2025-05-07T20:32:41.2356767Z         compiled: bool,
2025-05-07T20:32:41.2357013Z     ) -> None:
2025-05-07T20:32:41.2357237Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2357491Z     
2025-05-07T20:32:41.2357782Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2358135Z     
2025-05-07T20:32:41.2358343Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2358650Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2358971Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2359223Z         x0 = x[:, :D]
2025-05-07T20:32:41.2359452Z         x1 = x[:, D:]
2025-05-07T20:32:41.2359674Z     
2025-05-07T20:32:41.2359863Z         if contiguous:
2025-05-07T20:32:41.2360243Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2360517Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2360761Z     
2025-05-07T20:32:41.2360971Z         if scale_ub is not None:
2025-05-07T20:32:41.2361296Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2361658Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2361987Z             )
2025-05-07T20:32:41.2362184Z         else:
2025-05-07T20:32:41.2362404Z             scale_ub_tensor = None
2025-05-07T20:32:41.2362668Z     
2025-05-07T20:32:41.2362909Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2363240Z             op = silu_mul_quant
2025-05-07T20:32:41.2363506Z             if compiled:
2025-05-07T20:32:41.2363760Z                 op = torch.compile(op)
2025-05-07T20:32:41.2364072Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2364364Z     
2025-05-07T20:32:41.2364564Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2364743Z 
2025-05-07T20:32:41.2364849Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2365161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2365514Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2365811Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2366630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2367359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2367920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2368634Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2369334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2369941Z     kernel = self.compile(
2025-05-07T20:32:41.2370505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2371195Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2371616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2371900Z 
2025-05-07T20:32:41.2372157Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899b3ecf0>
2025-05-07T20:32:41.2373282Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2374730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899af2700>}
2025-05-07T20:32:41.2376127Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2377259Z context = <triton._C.libtriton.ir.context object at 0x7f0898ae3ab0>
2025-05-07T20:32:41.2377637Z 
2025-05-07T20:32:41.2377861Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2378551Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2379154Z                            module_map=module_map)
2025-05-07T20:32:41.2379539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2379904Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2380180Z E       ^
2025-05-07T20:32:41.2380664Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2381133Z 
2025-05-07T20:32:41.2381569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2382109Z 
2025-05-07T20:32:41.2382217Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2382653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2383076Z     T=128,
2025-05-07T20:32:41.2383267Z     D=7168,
2025-05-07T20:32:41.2383472Z     scale_ub=None,
2025-05-07T20:32:41.2383704Z     contiguous=False,
2025-05-07T20:32:41.2383936Z     compiled=True,
2025-05-07T20:32:41.2384151Z )
2025-05-07T20:32:41.2384484Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2384996Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.2385286Z 
2025-05-07T20:32:41.2385367Z     @given(
2025-05-07T20:32:41.2385613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2385936Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2386262Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2386610Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2387024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2387389Z     )
2025-05-07T20:32:41.2387846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2388487Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2388789Z         self,
2025-05-07T20:32:41.2388993Z         T: int,
2025-05-07T20:32:41.2389202Z         D: int,
2025-05-07T20:32:41.2389429Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2389716Z         contiguous: bool,
2025-05-07T20:32:41.2389971Z         compiled: bool,
2025-05-07T20:32:41.2390198Z     ) -> None:
2025-05-07T20:32:41.2390424Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2390678Z     
2025-05-07T20:32:41.2390958Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2391366Z     
2025-05-07T20:32:41.2391571Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2391877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2392200Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2392452Z         x0 = x[:, :D]
2025-05-07T20:32:41.2392684Z         x1 = x[:, D:]
2025-05-07T20:32:41.2392941Z     
2025-05-07T20:32:41.2393140Z         if contiguous:
2025-05-07T20:32:41.2393389Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2393699Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2393954Z     
2025-05-07T20:32:41.2394156Z         if scale_ub is not None:
2025-05-07T20:32:41.2394438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2394792Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2395117Z             )
2025-05-07T20:32:41.2395318Z         else:
2025-05-07T20:32:41.2395550Z             scale_ub_tensor = None
2025-05-07T20:32:41.2395812Z     
2025-05-07T20:32:41.2396052Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2396388Z             op = silu_mul_quant
2025-05-07T20:32:41.2396651Z             if compiled:
2025-05-07T20:32:41.2396903Z                 op = torch.compile(op)
2025-05-07T20:32:41.2397215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2397503Z     
2025-05-07T20:32:41.2397710Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.2398008Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.2398316Z     
2025-05-07T20:32:41.2398567Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2398914Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.2399221Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.2399553Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.2399925Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.2400340Z     
2025-05-07T20:32:41.2400554Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.2400760Z 
2025-05-07T20:32:41.2400877Z moe/activation_test.py:126: 
2025-05-07T20:32:41.2401184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2401544Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.2401894Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.2402723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.2403513Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.2404091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2404811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2405526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.2406289Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.2407061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.2407725Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.2408418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.2408966Z     fn()
2025-05-07T20:32:41.2409498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.2410104Z     self.fn.run(
2025-05-07T20:32:41.2410605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2411164Z     kernel = self.compile(
2025-05-07T20:32:41.2411729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2412465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2412891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2413134Z 
2025-05-07T20:32:41.2413553Z self = <triton.compiler.compiler.ASTSource object at 0x7f089918d9d0>
2025-05-07T20:32:41.2415072Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2416504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08991ab880>}
2025-05-07T20:32:41.2417901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2418972Z context = <triton._C.libtriton.ir.context object at 0x7f0898dee7f0>
2025-05-07T20:32:41.2419275Z 
2025-05-07T20:32:41.2419461Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2420009Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2420512Z                            module_map=module_map)
2025-05-07T20:32:41.2420901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2421273Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.2421554Z E       ^
2025-05-07T20:32:41.2422048Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2422517Z 
2025-05-07T20:32:41.2422957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.4806937Z 
2025-05-07T20:32:41.4807792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.4808475Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.4809058Z     T=128,
2025-05-07T20:32:41.4809331Z     D=7168,
2025-05-07T20:32:41.4809607Z     scale_ub=None,
2025-05-07T20:32:41.4809935Z     contiguous=False,
2025-05-07T20:32:41.4810177Z     compiled=False,
2025-05-07T20:32:41.4810399Z )
2025-05-07T20:32:41.4810735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.4811251Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.4811531Z 
2025-05-07T20:32:41.4811665Z     @given(
2025-05-07T20:32:41.4811916Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.4812266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.4812613Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.4812991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.4813634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.4813942Z     )
2025-05-07T20:32:41.4814310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.4814766Z     def test_silu_mul_quant(
2025-05-07T20:32:41.4815020Z         self,
2025-05-07T20:32:41.4815227Z         T: int,
2025-05-07T20:32:41.4822248Z         D: int,
2025-05-07T20:32:41.4822769Z         scale_ub: Optional[float],
2025-05-07T20:32:41.4823059Z         contiguous: bool,
2025-05-07T20:32:41.4823317Z         compiled: bool,
2025-05-07T20:32:41.4823561Z     ) -> None:
2025-05-07T20:32:41.4823782Z         torch.manual_seed(2025)
2025-05-07T20:32:41.4824041Z     
2025-05-07T20:32:41.4824332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.4824682Z     
2025-05-07T20:32:41.4824887Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.4825195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.4825612Z         x = x_sign * x_clamp
2025-05-07T20:32:41.4825863Z         x0 = x[:, :D]
2025-05-07T20:32:41.4826091Z         x1 = x[:, D:]
2025-05-07T20:32:41.4826301Z     
2025-05-07T20:32:41.4826505Z         if contiguous:
2025-05-07T20:32:41.4826746Z             x0 = x0.contiguous()
2025-05-07T20:32:41.4827113Z             x1 = x1.contiguous()
2025-05-07T20:32:41.4827356Z     
2025-05-07T20:32:41.4827561Z         if scale_ub is not None:
2025-05-07T20:32:41.4827934Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.4828281Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.4828605Z             )
2025-05-07T20:32:41.4828809Z         else:
2025-05-07T20:32:41.4829021Z             scale_ub_tensor = None
2025-05-07T20:32:41.4829281Z     
2025-05-07T20:32:41.4829523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4829847Z             op = silu_mul_quant
2025-05-07T20:32:41.4830110Z             if compiled:
2025-05-07T20:32:41.4830375Z                 op = torch.compile(op)
2025-05-07T20:32:41.4830678Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4830966Z     
2025-05-07T20:32:41.4831171Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.4831342Z 
2025-05-07T20:32:41.4831454Z moe/activation_test.py:117: 
2025-05-07T20:32:41.4831756Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4832111Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.4832409Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4833117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.4833836Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.4834394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.4835103Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.4835790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.4836343Z     kernel = self.compile(
2025-05-07T20:32:41.4836909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.4837590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.4838004Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4838247Z 
2025-05-07T20:32:41.4838460Z self = <triton.compiler.compiler.ASTSource object at 0x7f08995e0e50>
2025-05-07T20:32:41.4839582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.4841139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993fd580>}
2025-05-07T20:32:41.4842525Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.4843648Z context = <triton._C.libtriton.ir.context object at 0x7f0899229530>
2025-05-07T20:32:41.4843946Z 
2025-05-07T20:32:41.4844128Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.4844671Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.4845157Z                            module_map=module_map)
2025-05-07T20:32:41.4845540Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.4845910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.4846176Z E       ^
2025-05-07T20:32:41.4846713Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.4847195Z 
2025-05-07T20:32:41.4847668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.4848238Z 
2025-05-07T20:32:41.4848353Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.4848824Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.4849247Z     T=4096,
2025-05-07T20:32:41.4849448Z     D=5120,
2025-05-07T20:32:41.4849647Z     scale_ub=1200.0,
2025-05-07T20:32:41.4849882Z     contiguous=True,
2025-05-07T20:32:41.4850115Z     compiled=False,
2025-05-07T20:32:41.4850323Z )
2025-05-07T20:32:41.4850657Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.4851174Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.4851457Z 
2025-05-07T20:32:41.4851545Z     @given(
2025-05-07T20:32:41.4851780Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.4852108Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.4852433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.4852776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.4853126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.4853425Z     )
2025-05-07T20:32:41.4853784Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.4854245Z     def test_silu_mul_quant(
2025-05-07T20:32:41.4854498Z         self,
2025-05-07T20:32:41.4854696Z         T: int,
2025-05-07T20:32:41.4854905Z         D: int,
2025-05-07T20:32:41.4855135Z         scale_ub: Optional[float],
2025-05-07T20:32:41.4855411Z         contiguous: bool,
2025-05-07T20:32:41.4855660Z         compiled: bool,
2025-05-07T20:32:41.4855892Z     ) -> None:
2025-05-07T20:32:41.4856113Z         torch.manual_seed(2025)
2025-05-07T20:32:41.4856366Z     
2025-05-07T20:32:41.4856650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.4857005Z     
2025-05-07T20:32:41.4857198Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.4857502Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.4857827Z         x = x_sign * x_clamp
2025-05-07T20:32:41.4858073Z         x0 = x[:, :D]
2025-05-07T20:32:41.4858306Z         x1 = x[:, D:]
2025-05-07T20:32:41.4858526Z     
2025-05-07T20:32:41.4858715Z         if contiguous:
2025-05-07T20:32:41.4858956Z             x0 = x0.contiguous()
2025-05-07T20:32:41.4859228Z             x1 = x1.contiguous()
2025-05-07T20:32:41.4859471Z     
2025-05-07T20:32:41.4859670Z         if scale_ub is not None:
2025-05-07T20:32:41.4859957Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.4860301Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.4860624Z             )
2025-05-07T20:32:41.4860832Z         else:
2025-05-07T20:32:41.4861044Z             scale_ub_tensor = None
2025-05-07T20:32:41.4861307Z     
2025-05-07T20:32:41.4861554Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4861886Z             op = silu_mul_quant
2025-05-07T20:32:41.4862145Z             if compiled:
2025-05-07T20:32:41.4862408Z                 op = torch.compile(op)
2025-05-07T20:32:41.4862771Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4863057Z     
2025-05-07T20:32:41.4863262Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.4863432Z 
2025-05-07T20:32:41.4863543Z moe/activation_test.py:117: 
2025-05-07T20:32:41.4863842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4864191Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.4864489Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4865195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.4865956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.4866514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.4867244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.4868031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.4868586Z     kernel = self.compile(
2025-05-07T20:32:41.4869152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.4869832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.4870238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4870482Z 
2025-05-07T20:32:41.4870693Z self = <triton.compiler.compiler.ASTSource object at 0x7f08995e3850>
2025-05-07T20:32:41.4871808Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.4873224Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993fe7a0>}
2025-05-07T20:32:41.4874603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.4875661Z context = <triton._C.libtriton.ir.context object at 0x7f089927e870>
2025-05-07T20:32:41.4875965Z 
2025-05-07T20:32:41.4876138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.4876685Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.4877172Z                            module_map=module_map)
2025-05-07T20:32:41.4877600Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.4877969Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.4878235Z E       ^
2025-05-07T20:32:41.4878724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.4879193Z 
2025-05-07T20:32:41.4879624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.4880211Z 
2025-05-07T20:32:41.4880327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.4880752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.4881175Z     T=1,
2025-05-07T20:32:41.4881366Z     D=5120,
2025-05-07T20:32:41.4881561Z     scale_ub=None,
2025-05-07T20:32:41.4881791Z     contiguous=True,
2025-05-07T20:32:41.4882023Z     compiled=True,
2025-05-07T20:32:41.4882237Z )
2025-05-07T20:32:41.4882567Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.4883071Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.4883340Z 
2025-05-07T20:32:41.4883426Z     @given(
2025-05-07T20:32:41.4883717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.4884050Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.4884371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.4884712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.4885057Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.4885356Z     )
2025-05-07T20:32:41.4885719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.4886169Z     def test_silu_mul_quant(
2025-05-07T20:32:41.4886419Z         self,
2025-05-07T20:32:41.4886666Z         T: int,
2025-05-07T20:32:41.4886863Z         D: int,
2025-05-07T20:32:41.4887090Z         scale_ub: Optional[float],
2025-05-07T20:32:41.4887370Z         contiguous: bool,
2025-05-07T20:32:41.4887614Z         compiled: bool,
2025-05-07T20:32:41.4887848Z     ) -> None:
2025-05-07T20:32:41.4888149Z         torch.manual_seed(2025)
2025-05-07T20:32:41.4888393Z     
2025-05-07T20:32:41.4888718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.4889077Z     
2025-05-07T20:32:41.4889282Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.4889577Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.4889902Z         x = x_sign * x_clamp
2025-05-07T20:32:41.4890150Z         x0 = x[:, :D]
2025-05-07T20:32:41.4890364Z         x1 = x[:, D:]
2025-05-07T20:32:41.4890577Z     
2025-05-07T20:32:41.4890773Z         if contiguous:
2025-05-07T20:32:41.4891009Z             x0 = x0.contiguous()
2025-05-07T20:32:41.4891285Z             x1 = x1.contiguous()
2025-05-07T20:32:41.4891537Z     
2025-05-07T20:32:41.4891729Z         if scale_ub is not None:
2025-05-07T20:32:41.4892012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.4892362Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.4892676Z             )
2025-05-07T20:32:41.4892881Z         else:
2025-05-07T20:32:41.4893103Z             scale_ub_tensor = None
2025-05-07T20:32:41.4893357Z     
2025-05-07T20:32:41.4893601Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4893930Z             op = silu_mul_quant
2025-05-07T20:32:41.4894190Z             if compiled:
2025-05-07T20:32:41.4894459Z                 op = torch.compile(op)
2025-05-07T20:32:41.4894760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.4895046Z     
2025-05-07T20:32:41.4895250Z         y_fp8, y_scale = fn()
2025-05-07T20:32:41.4895541Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:41.4895847Z     
2025-05-07T20:32:41.4896097Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.4896438Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:41.4896746Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:41.4897074Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:41.4897492Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.4897824Z     
2025-05-07T20:32:41.4898041Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:41.4898243Z 
2025-05-07T20:32:41.4898359Z moe/activation_test.py:126: 
2025-05-07T20:32:41.4898662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4899014Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:41.4899359Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:41.4900167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:41.4900954Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:41.4901524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.4902234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.4902997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:41.4903748Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:41.4904509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:41.4905174Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:41.4905793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:41.4906378Z     fn()
2025-05-07T20:32:41.4906907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:41.4907557Z     self.fn.run(
2025-05-07T20:32:41.4908050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.4908644Z     kernel = self.compile(
2025-05-07T20:32:41.4909265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.4909940Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.4910355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.4910588Z 
2025-05-07T20:32:41.4910808Z self = <triton.compiler.compiler.ASTSource object at 0x7f08989c6a80>
2025-05-07T20:32:41.4911928Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.4913630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993ff420>}
2025-05-07T20:32:41.4915033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.4916093Z context = <triton._C.libtriton.ir.context object at 0x7f08992656f0>
2025-05-07T20:32:41.4916392Z 
2025-05-07T20:32:41.4916571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.4917128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.4917649Z                            module_map=module_map)
2025-05-07T20:32:41.4918039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.4918416Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:41.4918688Z E       ^
2025-05-07T20:32:41.4919172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.4919640Z 
2025-05-07T20:32:41.4920080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1830114Z 
2025-05-07T20:32:42.1830506Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1831168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1831778Z     T=2048,
2025-05-07T20:32:42.1832026Z     D=5120,
2025-05-07T20:32:42.1832226Z     scale_ub=None,
2025-05-07T20:32:42.1832463Z     contiguous=True,
2025-05-07T20:32:42.1832706Z     compiled=True,
2025-05-07T20:32:42.1832922Z )
2025-05-07T20:32:42.1833265Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1833816Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1834099Z 
2025-05-07T20:32:42.1834190Z     @given(
2025-05-07T20:32:42.1834440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1834788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1835408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1835771Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1836129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1836436Z     )
2025-05-07T20:32:42.1836808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1837287Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1837553Z         self,
2025-05-07T20:32:42.1837804Z         T: int,
2025-05-07T20:32:42.1838019Z         D: int,
2025-05-07T20:32:42.1838249Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1838631Z         contiguous: bool,
2025-05-07T20:32:42.1838889Z         compiled: bool,
2025-05-07T20:32:42.1839122Z     ) -> None:
2025-05-07T20:32:42.1839355Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1839614Z     
2025-05-07T20:32:42.1839899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1840432Z     
2025-05-07T20:32:42.1840647Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1841035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1841365Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1841624Z         x0 = x[:, :D]
2025-05-07T20:32:42.1841857Z         x1 = x[:, D:]
2025-05-07T20:32:42.1842075Z     
2025-05-07T20:32:42.1842277Z         if contiguous:
2025-05-07T20:32:42.1842525Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1842794Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1843054Z     
2025-05-07T20:32:42.1843262Z         if scale_ub is not None:
2025-05-07T20:32:42.1843549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1843908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1844242Z             )
2025-05-07T20:32:42.1844443Z         else:
2025-05-07T20:32:42.1844670Z             scale_ub_tensor = None
2025-05-07T20:32:42.1844942Z     
2025-05-07T20:32:42.1845190Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1845530Z             op = silu_mul_quant
2025-05-07T20:32:42.1845804Z             if compiled:
2025-05-07T20:32:42.1846070Z                 op = torch.compile(op)
2025-05-07T20:32:42.1846385Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1846685Z     
2025-05-07T20:32:42.1846895Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1847198Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1847515Z     
2025-05-07T20:32:42.1847771Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1848124Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1848443Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1848784Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1849163Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1849502Z     
2025-05-07T20:32:42.1849721Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1849933Z 
2025-05-07T20:32:42.1850051Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1850364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1850727Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1851081Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1851914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1852710Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1853293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1854024Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1854751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1855577Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1856362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1857043Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1857680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1858230Z     fn()
2025-05-07T20:32:42.1858771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1859427Z     self.fn.run(
2025-05-07T20:32:42.1859927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1860492Z     kernel = self.compile(
2025-05-07T20:32:42.1861059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1861803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1862274Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1862520Z 
2025-05-07T20:32:42.1862748Z self = <triton.compiler.compiler.ASTSource object at 0x7f08989c6b70>
2025-05-07T20:32:42.1863883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1865333Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993b9da0>}
2025-05-07T20:32:42.1866729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1867795Z context = <triton._C.libtriton.ir.context object at 0x7f08986a68f0>
2025-05-07T20:32:42.1868097Z 
2025-05-07T20:32:42.1868282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1868829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1869327Z                            module_map=module_map)
2025-05-07T20:32:42.1869718Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1870091Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1870381Z E       ^
2025-05-07T20:32:42.1870874Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1871343Z 
2025-05-07T20:32:42.1871786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.1872321Z 
2025-05-07T20:32:42.1872436Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.1872883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.1873315Z     T=128,
2025-05-07T20:32:42.1873514Z     D=5120,
2025-05-07T20:32:42.1873727Z     scale_ub=None,
2025-05-07T20:32:42.1873962Z     contiguous=True,
2025-05-07T20:32:42.1874207Z     compiled=True,
2025-05-07T20:32:42.1874420Z )
2025-05-07T20:32:42.1874765Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.1875290Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.1875571Z 
2025-05-07T20:32:42.1875653Z     @given(
2025-05-07T20:32:42.1875905Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.1876244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.1876567Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.1876918Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.1877324Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.1877629Z     )
2025-05-07T20:32:42.1878004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.1878472Z     def test_silu_mul_quant(
2025-05-07T20:32:42.1878727Z         self,
2025-05-07T20:32:42.1878928Z         T: int,
2025-05-07T20:32:42.1879139Z         D: int,
2025-05-07T20:32:42.1879373Z         scale_ub: Optional[float],
2025-05-07T20:32:42.1879651Z         contiguous: bool,
2025-05-07T20:32:42.1879901Z         compiled: bool,
2025-05-07T20:32:42.1880207Z     ) -> None:
2025-05-07T20:32:42.1880508Z         torch.manual_seed(2025)
2025-05-07T20:32:42.1880761Z     
2025-05-07T20:32:42.1881048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.1881398Z     
2025-05-07T20:32:42.1881604Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.1881908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.1888205Z         x = x_sign * x_clamp
2025-05-07T20:32:42.1888516Z         x0 = x[:, :D]
2025-05-07T20:32:42.1888824Z         x1 = x[:, D:]
2025-05-07T20:32:42.1889045Z     
2025-05-07T20:32:42.1889248Z         if contiguous:
2025-05-07T20:32:42.1889500Z             x0 = x0.contiguous()
2025-05-07T20:32:42.1889773Z             x1 = x1.contiguous()
2025-05-07T20:32:42.1890032Z     
2025-05-07T20:32:42.1890241Z         if scale_ub is not None:
2025-05-07T20:32:42.1890527Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.1890891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.1891232Z             )
2025-05-07T20:32:42.1891439Z         else:
2025-05-07T20:32:42.1891665Z             scale_ub_tensor = None
2025-05-07T20:32:42.1891934Z     
2025-05-07T20:32:42.1892180Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1892524Z             op = silu_mul_quant
2025-05-07T20:32:42.1892798Z             if compiled:
2025-05-07T20:32:42.1893062Z                 op = torch.compile(op)
2025-05-07T20:32:42.1893382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.1893676Z     
2025-05-07T20:32:42.1893883Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.1894186Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.1894502Z     
2025-05-07T20:32:42.1894759Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.1895111Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.1895424Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.1895758Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.1896137Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1896472Z     
2025-05-07T20:32:42.1896696Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.1896900Z 
2025-05-07T20:32:42.1897016Z moe/activation_test.py:126: 
2025-05-07T20:32:42.1897325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1897693Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.1898049Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.1898869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.1899657Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.1900232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.1900948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.1901663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.1902426Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.1903253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.1903928Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.1904569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.1905113Z     fn()
2025-05-07T20:32:42.1905648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.1906254Z     self.fn.run(
2025-05-07T20:32:42.1906750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.1907356Z     kernel = self.compile(
2025-05-07T20:32:42.1907924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.1908619Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.1909101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.1909347Z 
2025-05-07T20:32:42.1909614Z self = <triton.compiler.compiler.ASTSource object at 0x7f089903def0>
2025-05-07T20:32:42.1910738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.1912177Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898508540>}
2025-05-07T20:32:42.1913911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.1914985Z context = <triton._C.libtriton.ir.context object at 0x7f089877df70>
2025-05-07T20:32:42.1915290Z 
2025-05-07T20:32:42.1915479Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.1916027Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.1916523Z                            module_map=module_map)
2025-05-07T20:32:42.1916914Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.1917285Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.1917572Z E       ^
2025-05-07T20:32:42.1918062Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.1918533Z 
2025-05-07T20:32:42.1918973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.9718146Z 
2025-05-07T20:32:42.9718837Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9719502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9720198Z     T=4096,
2025-05-07T20:32:42.9720469Z     D=5120,
2025-05-07T20:32:42.9720741Z     scale_ub=None,
2025-05-07T20:32:42.9721029Z     contiguous=True,
2025-05-07T20:32:42.9721316Z     compiled=True,
2025-05-07T20:32:42.9721537Z )
2025-05-07T20:32:42.9721870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9722396Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.9722689Z 
2025-05-07T20:32:42.9722777Z     @given(
2025-05-07T20:32:42.9723023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9723361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9723692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9724044Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9724387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9724694Z     )
2025-05-07T20:32:42.9725221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9725694Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9725957Z         self,
2025-05-07T20:32:42.9726171Z         T: int,
2025-05-07T20:32:42.9726386Z         D: int,
2025-05-07T20:32:42.9726617Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9726912Z         contiguous: bool,
2025-05-07T20:32:42.9727167Z         compiled: bool,
2025-05-07T20:32:42.9727402Z     ) -> None:
2025-05-07T20:32:42.9727630Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9727890Z     
2025-05-07T20:32:42.9728175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9728618Z     
2025-05-07T20:32:42.9728830Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9729135Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9729467Z         x = x_sign * x_clamp
2025-05-07T20:32:42.9729726Z         x0 = x[:, :D]
2025-05-07T20:32:42.9730258Z         x1 = x[:, D:]
2025-05-07T20:32:42.9730484Z     
2025-05-07T20:32:42.9730692Z         if contiguous:
2025-05-07T20:32:42.9731008Z             x0 = x0.contiguous()
2025-05-07T20:32:42.9731293Z             x1 = x1.contiguous()
2025-05-07T20:32:42.9731555Z     
2025-05-07T20:32:42.9731756Z         if scale_ub is not None:
2025-05-07T20:32:42.9732052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.9732416Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.9732749Z             )
2025-05-07T20:32:42.9732953Z         else:
2025-05-07T20:32:42.9733181Z             scale_ub_tensor = None
2025-05-07T20:32:42.9733455Z     
2025-05-07T20:32:42.9733697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9734036Z             op = silu_mul_quant
2025-05-07T20:32:42.9734310Z             if compiled:
2025-05-07T20:32:42.9734571Z                 op = torch.compile(op)
2025-05-07T20:32:42.9734896Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9735197Z     
2025-05-07T20:32:42.9735404Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.9735717Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.9736033Z     
2025-05-07T20:32:42.9736283Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9736683Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.9736994Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.9737335Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.9737722Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.9738049Z     
2025-05-07T20:32:42.9738272Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.9738490Z 
2025-05-07T20:32:42.9738601Z moe/activation_test.py:126: 
2025-05-07T20:32:42.9738925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9739282Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.9739641Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.9740487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.9741280Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.9741865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.9742598Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.9743328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.9744090Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.9744875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.9745557Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.9746267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.9746814Z     fn()
2025-05-07T20:32:42.9747354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.9748025Z     self.fn.run(
2025-05-07T20:32:42.9748518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.9749082Z     kernel = self.compile(
2025-05-07T20:32:42.9749657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.9750397Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.9750826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9751070Z 
2025-05-07T20:32:42.9751295Z self = <triton.compiler.compiler.ASTSource object at 0x7f08987beb30>
2025-05-07T20:32:42.9752884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.9756740Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08986d2200>}
2025-05-07T20:32:42.9758138Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.9759205Z context = <triton._C.libtriton.ir.context object at 0x7f08982df2f0>
2025-05-07T20:32:42.9759505Z 
2025-05-07T20:32:42.9759686Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.9760348Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.9760841Z                            module_map=module_map)
2025-05-07T20:32:42.9761227Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.9761596Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.9761914Z E       ^
2025-05-07T20:32:42.9762400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.9762869Z 
2025-05-07T20:32:42.9763306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.9763841Z 
2025-05-07T20:32:42.9763949Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.9764384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.9764801Z     T=16384,
2025-05-07T20:32:42.9765001Z     D=5120,
2025-05-07T20:32:42.9765199Z     scale_ub=None,
2025-05-07T20:32:42.9765419Z     contiguous=True,
2025-05-07T20:32:42.9765650Z     compiled=True,
2025-05-07T20:32:42.9765857Z )
2025-05-07T20:32:42.9766190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.9766704Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.9766988Z 
2025-05-07T20:32:42.9767067Z     @given(
2025-05-07T20:32:42.9767306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.9767632Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.9767944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.9768292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.9768639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.9768938Z     )
2025-05-07T20:32:42.9769314Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.9769770Z     def test_silu_mul_quant(
2025-05-07T20:32:42.9770025Z         self,
2025-05-07T20:32:42.9770305Z         T: int,
2025-05-07T20:32:42.9770507Z         D: int,
2025-05-07T20:32:42.9770738Z         scale_ub: Optional[float],
2025-05-07T20:32:42.9771023Z         contiguous: bool,
2025-05-07T20:32:42.9771267Z         compiled: bool,
2025-05-07T20:32:42.9771501Z     ) -> None:
2025-05-07T20:32:42.9771726Z         torch.manual_seed(2025)
2025-05-07T20:32:42.9771970Z     
2025-05-07T20:32:42.9772256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.9772612Z     
2025-05-07T20:32:42.9772809Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.9773157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.9773484Z         x = x_sign * x_clamp
2025-05-07T20:32:42.9773733Z         x0 = x[:, :D]
2025-05-07T20:32:42.9773952Z         x1 = x[:, D:]
2025-05-07T20:32:42.9774169Z     
2025-05-07T20:32:42.9774362Z         if contiguous:
2025-05-07T20:32:42.9774596Z             x0 = x0.contiguous()
2025-05-07T20:32:42.9774910Z             x1 = x1.contiguous()
2025-05-07T20:32:42.9775162Z     
2025-05-07T20:32:42.9775398Z         if scale_ub is not None:
2025-05-07T20:32:42.9775684Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.9776038Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.9776354Z             )
2025-05-07T20:32:42.9776555Z         else:
2025-05-07T20:32:42.9776772Z             scale_ub_tensor = None
2025-05-07T20:32:42.9777025Z     
2025-05-07T20:32:42.9777268Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9777600Z             op = silu_mul_quant
2025-05-07T20:32:42.9777860Z             if compiled:
2025-05-07T20:32:42.9778117Z                 op = torch.compile(op)
2025-05-07T20:32:42.9778426Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.9778714Z     
2025-05-07T20:32:42.9778910Z         y_fp8, y_scale = fn()
2025-05-07T20:32:42.9779207Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:42.9779514Z     
2025-05-07T20:32:42.9779763Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.9780114Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:42.9780421Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:42.9780744Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:42.9781123Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.9781449Z     
2025-05-07T20:32:42.9781654Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:42.9781864Z 
2025-05-07T20:32:42.9781968Z moe/activation_test.py:126: 
2025-05-07T20:32:42.9782282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9782630Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:42.9782960Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:42.9783769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:42.9784552Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:42.9785112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.9785816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.9786520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:42.9787263Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:42.9788017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:42.9788724Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:42.9789353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:42.9789899Z     fn()
2025-05-07T20:32:42.9790477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:42.9791091Z     self.fn.run(
2025-05-07T20:32:42.9791578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.9792122Z     kernel = self.compile(
2025-05-07T20:32:42.9792684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.9793363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.9793819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.9794055Z 
2025-05-07T20:32:42.9794267Z self = <triton.compiler.compiler.ASTSource object at 0x7f08987ae0d0>
2025-05-07T20:32:42.9795424Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.9796887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07adb34900>}
2025-05-07T20:32:42.9798566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.9799771Z context = <triton._C.libtriton.ir.context object at 0x7f0898880ff0>
2025-05-07T20:32:42.9800162Z 
2025-05-07T20:32:42.9800338Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.9800886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.9801375Z                            module_map=module_map)
2025-05-07T20:32:42.9801756Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.9802134Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:42.9802415Z E       ^
2025-05-07T20:32:42.9802892Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.9803365Z 
2025-05-07T20:32:42.9803799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.9993632Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:42.9995118Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:42.9996612Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:42.9997642Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:42.9998860Z W0507 20:32:42.998000 228507 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:43.3976275Z 
2025-05-07T20:32:43.3976651Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.3977301Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.3977938Z     T=1,
2025-05-07T20:32:43.3978206Z     D=5120,
2025-05-07T20:32:43.3978494Z     scale_ub=1200.0,
2025-05-07T20:32:43.3978757Z     contiguous=True,
2025-05-07T20:32:43.3978996Z     compiled=True,
2025-05-07T20:32:43.3979219Z )
2025-05-07T20:32:43.3979678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.3980205Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.3980482Z 
2025-05-07T20:32:43.3980573Z     @given(
2025-05-07T20:32:43.3980818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.3981156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.3981487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.3981847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.3982194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.3982565Z     )
2025-05-07T20:32:43.3982941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.3983406Z     def test_silu_mul_quant(
2025-05-07T20:32:43.3983667Z         self,
2025-05-07T20:32:43.3983879Z         T: int,
2025-05-07T20:32:43.3984087Z         D: int,
2025-05-07T20:32:43.3984388Z         scale_ub: Optional[float],
2025-05-07T20:32:43.3984678Z         contiguous: bool,
2025-05-07T20:32:43.3984987Z         compiled: bool,
2025-05-07T20:32:43.3985229Z     ) -> None:
2025-05-07T20:32:43.3985458Z         torch.manual_seed(2025)
2025-05-07T20:32:43.3985709Z     
2025-05-07T20:32:43.3986003Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.3986369Z     
2025-05-07T20:32:43.3986569Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.3986883Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.3987215Z         x = x_sign * x_clamp
2025-05-07T20:32:43.3987472Z         x0 = x[:, :D]
2025-05-07T20:32:43.3987699Z         x1 = x[:, D:]
2025-05-07T20:32:43.3987921Z     
2025-05-07T20:32:43.3988121Z         if contiguous:
2025-05-07T20:32:43.3988366Z             x0 = x0.contiguous()
2025-05-07T20:32:43.3988644Z             x1 = x1.contiguous()
2025-05-07T20:32:43.3988898Z     
2025-05-07T20:32:43.3989096Z         if scale_ub is not None:
2025-05-07T20:32:43.3989395Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.3989756Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.3990079Z             )
2025-05-07T20:32:43.3990286Z         else:
2025-05-07T20:32:43.3990511Z             scale_ub_tensor = None
2025-05-07T20:32:43.3990773Z     
2025-05-07T20:32:43.3991023Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.3991360Z             op = silu_mul_quant
2025-05-07T20:32:43.3991621Z             if compiled:
2025-05-07T20:32:43.3991886Z                 op = torch.compile(op)
2025-05-07T20:32:43.3992205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.3992505Z     
2025-05-07T20:32:43.3992709Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.3992891Z 
2025-05-07T20:32:43.3993001Z moe/activation_test.py:117: 
2025-05-07T20:32:43.3993318Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.3993675Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.3993982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.3994580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.3995171Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.3995876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.3996600Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.3997173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.3997927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.3998649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.3999213Z     kernel = self.compile(
2025-05-07T20:32:43.3999887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.4000678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.4001107Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4001352Z 
2025-05-07T20:32:43.4001576Z self = <triton.compiler.compiler.ASTSource object at 0x7f089826c410>
2025-05-07T20:32:43.4002704Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.4004196Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad73cd60>}
2025-05-07T20:32:43.4005647Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.4006761Z context = <triton._C.libtriton.ir.context object at 0x7f08989021f0>
2025-05-07T20:32:43.4007065Z 
2025-05-07T20:32:43.4007251Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.4007801Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.4008301Z                            module_map=module_map)
2025-05-07T20:32:43.4008689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.4009063Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.4009340Z E       ^
2025-05-07T20:32:43.4009830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.4010303Z 
2025-05-07T20:32:43.4010747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.4011293Z 
2025-05-07T20:32:43.4011408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.4011854Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.4012283Z     T=1,
2025-05-07T20:32:43.4012475Z     D=5120,
2025-05-07T20:32:43.4012682Z     scale_ub=None,
2025-05-07T20:32:43.4012911Z     contiguous=False,
2025-05-07T20:32:43.4013157Z     compiled=True,
2025-05-07T20:32:43.4013557Z )
2025-05-07T20:32:43.4013901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.4014425Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.4014702Z 
2025-05-07T20:32:43.4014788Z     @given(
2025-05-07T20:32:43.4015039Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.4015377Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.4015698Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.4022346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.4022750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.4023051Z     )
2025-05-07T20:32:43.4023429Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.4023908Z     def test_silu_mul_quant(
2025-05-07T20:32:43.4024171Z         self,
2025-05-07T20:32:43.4024373Z         T: int,
2025-05-07T20:32:43.4024585Z         D: int,
2025-05-07T20:32:43.4024818Z         scale_ub: Optional[float],
2025-05-07T20:32:43.4025105Z         contiguous: bool,
2025-05-07T20:32:43.4025368Z         compiled: bool,
2025-05-07T20:32:43.4025610Z     ) -> None:
2025-05-07T20:32:43.4025834Z         torch.manual_seed(2025)
2025-05-07T20:32:43.4026093Z     
2025-05-07T20:32:43.4026382Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.4026739Z     
2025-05-07T20:32:43.4026947Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.4027368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.4027715Z         x = x_sign * x_clamp
2025-05-07T20:32:43.4028007Z         x0 = x[:, :D]
2025-05-07T20:32:43.4028242Z         x1 = x[:, D:]
2025-05-07T20:32:43.4028456Z     
2025-05-07T20:32:43.4028653Z         if contiguous:
2025-05-07T20:32:43.4028898Z             x0 = x0.contiguous()
2025-05-07T20:32:43.4029164Z             x1 = x1.contiguous()
2025-05-07T20:32:43.4029420Z     
2025-05-07T20:32:43.4029624Z         if scale_ub is not None:
2025-05-07T20:32:43.4029913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.4030265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.4030668Z             )
2025-05-07T20:32:43.4030877Z         else:
2025-05-07T20:32:43.4031097Z             scale_ub_tensor = None
2025-05-07T20:32:43.4031364Z     
2025-05-07T20:32:43.4031616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.4032009Z             op = silu_mul_quant
2025-05-07T20:32:43.4032275Z             if compiled:
2025-05-07T20:32:43.4032597Z                 op = torch.compile(op)
2025-05-07T20:32:43.4032908Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.4033202Z     
2025-05-07T20:32:43.4033408Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.4033705Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.4034011Z     
2025-05-07T20:32:43.4034264Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.4034620Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.4034926Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.4035262Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.4035641Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.4035965Z     
2025-05-07T20:32:43.4036184Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.4036390Z 
2025-05-07T20:32:43.4036512Z moe/activation_test.py:126: 
2025-05-07T20:32:43.4036824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4037189Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.4037541Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.4038390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.4039216Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.4039796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.4040616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.4041342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.4042109Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.4042888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.4043565Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.4044202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.4044747Z     fn()
2025-05-07T20:32:43.4045285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.4045900Z     self.fn.run(
2025-05-07T20:32:43.4046395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.4046959Z     kernel = self.compile(
2025-05-07T20:32:43.4047532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.4048220Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.4048753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.4049003Z 
2025-05-07T20:32:43.4049223Z self = <triton.compiler.compiler.ASTSource object at 0x7f089826eb10>
2025-05-07T20:32:43.4050356Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.4051791Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad732de0>}
2025-05-07T20:32:43.4053242Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.4054362Z context = <triton._C.libtriton.ir.context object at 0x7f07ad4d7330>
2025-05-07T20:32:43.4054673Z 
2025-05-07T20:32:43.4054920Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.4055477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.4055963Z                            module_map=module_map)
2025-05-07T20:32:43.4056351Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.4056731Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.4057006Z E       ^
2025-05-07T20:32:43.4057495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.4057966Z 
2025-05-07T20:32:43.4058409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5462829Z 
2025-05-07T20:32:43.5463980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5465358Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5466481Z     T=1,
2025-05-07T20:32:43.5466874Z     D=5120,
2025-05-07T20:32:43.5467288Z     scale_ub=None,
2025-05-07T20:32:43.5467742Z     contiguous=True,
2025-05-07T20:32:43.5468021Z     compiled=False,
2025-05-07T20:32:43.5468277Z )
2025-05-07T20:32:43.5468625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5469141Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.5469425Z 
2025-05-07T20:32:43.5469513Z     @given(
2025-05-07T20:32:43.5469778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5470113Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5470449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5470808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5471168Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5471476Z     )
2025-05-07T20:32:43.5471868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5472344Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5472603Z         self,
2025-05-07T20:32:43.5472826Z         T: int,
2025-05-07T20:32:43.5473045Z         D: int,
2025-05-07T20:32:43.5473279Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5473582Z         contiguous: bool,
2025-05-07T20:32:43.5473846Z         compiled: bool,
2025-05-07T20:32:43.5474089Z     ) -> None:
2025-05-07T20:32:43.5474330Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5474598Z     
2025-05-07T20:32:43.5474891Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5475264Z     
2025-05-07T20:32:43.5475480Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5475790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5476131Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5476398Z         x0 = x[:, :D]
2025-05-07T20:32:43.5476638Z         x1 = x[:, D:]
2025-05-07T20:32:43.5477171Z     
2025-05-07T20:32:43.5477382Z         if contiguous:
2025-05-07T20:32:43.5477641Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5477915Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5478184Z     
2025-05-07T20:32:43.5478398Z         if scale_ub is not None:
2025-05-07T20:32:43.5478691Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5479059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5479400Z             )
2025-05-07T20:32:43.5479606Z         else:
2025-05-07T20:32:43.5479966Z             scale_ub_tensor = None
2025-05-07T20:32:43.5480352Z     
2025-05-07T20:32:43.5480600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5480948Z             op = silu_mul_quant
2025-05-07T20:32:43.5481222Z             if compiled:
2025-05-07T20:32:43.5481484Z                 op = torch.compile(op)
2025-05-07T20:32:43.5481900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5482204Z     
2025-05-07T20:32:43.5482487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5482678Z 
2025-05-07T20:32:43.5482789Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5483112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5483473Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5483772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5484508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5485247Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5485814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5486545Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5487246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5487826Z     kernel = self.compile(
2025-05-07T20:32:43.5488400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5489099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5489530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5489809Z 
2025-05-07T20:32:43.5490029Z self = <triton.compiler.compiler.ASTSource object at 0x7f07addb33e0>
2025-05-07T20:32:43.5491164Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5492623Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898019b20>}
2025-05-07T20:32:43.5494020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5495098Z context = <triton._C.libtriton.ir.context object at 0x7f07ad48b170>
2025-05-07T20:32:43.5495415Z 
2025-05-07T20:32:43.5495595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5496155Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5496648Z                            module_map=module_map)
2025-05-07T20:32:43.5497046Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5497429Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5497712Z E       ^
2025-05-07T20:32:43.5498202Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5498735Z 
2025-05-07T20:32:43.5499176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5499712Z 
2025-05-07T20:32:43.5499832Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5500275Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5500699Z     T=128,
2025-05-07T20:32:43.5500906Z     D=5120,
2025-05-07T20:32:43.5501118Z     scale_ub=None,
2025-05-07T20:32:43.5501348Z     contiguous=False,
2025-05-07T20:32:43.5501596Z     compiled=True,
2025-05-07T20:32:43.5501862Z )
2025-05-07T20:32:43.5502193Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.5502711Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.5502991Z 
2025-05-07T20:32:43.5503080Z     @given(
2025-05-07T20:32:43.5503359Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.5503696Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.5504063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.5504418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.5504762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.5505068Z     )
2025-05-07T20:32:43.5505442Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.5505901Z     def test_silu_mul_quant(
2025-05-07T20:32:43.5506159Z         self,
2025-05-07T20:32:43.5506370Z         T: int,
2025-05-07T20:32:43.5506579Z         D: int,
2025-05-07T20:32:43.5506816Z         scale_ub: Optional[float],
2025-05-07T20:32:43.5507110Z         contiguous: bool,
2025-05-07T20:32:43.5507384Z         compiled: bool,
2025-05-07T20:32:43.5507616Z     ) -> None:
2025-05-07T20:32:43.5507849Z         torch.manual_seed(2025)
2025-05-07T20:32:43.5508110Z     
2025-05-07T20:32:43.5508402Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.5508769Z     
2025-05-07T20:32:43.5508986Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.5509299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.5509624Z         x = x_sign * x_clamp
2025-05-07T20:32:43.5509884Z         x0 = x[:, :D]
2025-05-07T20:32:43.5510118Z         x1 = x[:, D:]
2025-05-07T20:32:43.5510334Z     
2025-05-07T20:32:43.5510537Z         if contiguous:
2025-05-07T20:32:43.5510787Z             x0 = x0.contiguous()
2025-05-07T20:32:43.5511058Z             x1 = x1.contiguous()
2025-05-07T20:32:43.5511315Z     
2025-05-07T20:32:43.5511525Z         if scale_ub is not None:
2025-05-07T20:32:43.5511809Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.5512170Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.5512498Z             )
2025-05-07T20:32:43.5512697Z         else:
2025-05-07T20:32:43.5512928Z             scale_ub_tensor = None
2025-05-07T20:32:43.5513195Z     
2025-05-07T20:32:43.5513753Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.5514092Z             op = silu_mul_quant
2025-05-07T20:32:43.5514362Z             if compiled:
2025-05-07T20:32:43.5514625Z                 op = torch.compile(op)
2025-05-07T20:32:43.5514935Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5515229Z     
2025-05-07T20:32:43.5515439Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.5515613Z 
2025-05-07T20:32:43.5515719Z moe/activation_test.py:117: 
2025-05-07T20:32:43.5516035Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5516396Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.5516693Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.5517286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.5517876Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.5518662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.5519381Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.5519951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.5520801Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.5521491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.5522054Z     kernel = self.compile(
2025-05-07T20:32:43.5522693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.5523387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.5523804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.5524118Z 
2025-05-07T20:32:43.5524340Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad738e10>
2025-05-07T20:32:43.5525524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.5526965Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad733a60>}
2025-05-07T20:32:43.5528358Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.5529427Z context = <triton._C.libtriton.ir.context object at 0x7f07ad4e1ff0>
2025-05-07T20:32:43.5529743Z 
2025-05-07T20:32:43.5529924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.5530482Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.5530970Z                            module_map=module_map)
2025-05-07T20:32:43.5531362Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.5531741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.5532025Z E       ^
2025-05-07T20:32:43.5532509Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.5532989Z 
2025-05-07T20:32:43.5533431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.5533966Z 
2025-05-07T20:32:43.5534087Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.5534528Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.5534952Z     T=128,
2025-05-07T20:32:43.5535158Z     D=7168,
2025-05-07T20:32:43.5535375Z     scale_ub=1200.0,
2025-05-07T20:32:43.5535617Z     contiguous=False,
2025-05-07T20:32:43.5535863Z     compiled=False,
2025-05-07T20:32:43.7097410Z )
2025-05-07T20:32:43.7098412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7100051Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.7100878Z 
2025-05-07T20:32:43.7101059Z     @given(
2025-05-07T20:32:43.7101540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7102177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7102839Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7103524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7104193Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7104789Z     )
2025-05-07T20:32:43.7105520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7106714Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7107212Z         self,
2025-05-07T20:32:43.7107638Z         T: int,
2025-05-07T20:32:43.7108094Z         D: int,
2025-05-07T20:32:43.7108421Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7108786Z         contiguous: bool,
2025-05-07T20:32:43.7109113Z         compiled: bool,
2025-05-07T20:32:43.7109409Z     ) -> None:
2025-05-07T20:32:43.7109663Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7109923Z     
2025-05-07T20:32:43.7110208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7110575Z     
2025-05-07T20:32:43.7110878Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.7111180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.7111515Z         x = x_sign * x_clamp
2025-05-07T20:32:43.7111772Z         x0 = x[:, :D]
2025-05-07T20:32:43.7111998Z         x1 = x[:, D:]
2025-05-07T20:32:43.7112224Z     
2025-05-07T20:32:43.7112504Z         if contiguous:
2025-05-07T20:32:43.7112748Z             x0 = x0.contiguous()
2025-05-07T20:32:43.7113143Z             x1 = x1.contiguous()
2025-05-07T20:32:43.7113707Z     
2025-05-07T20:32:43.7113917Z         if scale_ub is not None:
2025-05-07T20:32:43.7114200Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.7114557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.7114891Z             )
2025-05-07T20:32:43.7115094Z         else:
2025-05-07T20:32:43.7115321Z             scale_ub_tensor = None
2025-05-07T20:32:43.7115589Z     
2025-05-07T20:32:43.7115831Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.7116173Z             op = silu_mul_quant
2025-05-07T20:32:43.7116441Z             if compiled:
2025-05-07T20:32:43.7116701Z                 op = torch.compile(op)
2025-05-07T20:32:43.7117018Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7117313Z     
2025-05-07T20:32:43.7117519Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.7117700Z 
2025-05-07T20:32:43.7117811Z moe/activation_test.py:117: 
2025-05-07T20:32:43.7118128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7118483Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.7118778Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7119510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.7120337Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.7120899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.7121621Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.7122321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.7122884Z     kernel = self.compile(
2025-05-07T20:32:43.7123459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.7124152Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.7124575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7124862Z 
2025-05-07T20:32:43.7125080Z self = <triton.compiler.compiler.ASTSource object at 0x7f08991a8230>
2025-05-07T20:32:43.7126214Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.7127678Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07adb3a2a0>}
2025-05-07T20:32:43.7129183Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.7130266Z context = <triton._C.libtriton.ir.context object at 0x7f07ad310cf0>
2025-05-07T20:32:43.7130579Z 
2025-05-07T20:32:43.7130755Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.7131308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.7131795Z                            module_map=module_map)
2025-05-07T20:32:43.7132183Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.7132625Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.7132907Z E       ^
2025-05-07T20:32:43.7133389Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.7133867Z 
2025-05-07T20:32:43.7134366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.7134958Z 
2025-05-07T20:32:43.7135079Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7135512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7135940Z     T=128,
2025-05-07T20:32:43.7136146Z     D=5120,
2025-05-07T20:32:43.7136358Z     scale_ub=None,
2025-05-07T20:32:43.7136581Z     contiguous=False,
2025-05-07T20:32:43.7136826Z     compiled=False,
2025-05-07T20:32:43.7137047Z )
2025-05-07T20:32:43.7137379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7137908Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.7138192Z 
2025-05-07T20:32:43.7138285Z     @given(
2025-05-07T20:32:43.7138526Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7138862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7139195Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7139544Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7139904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7140214Z     )
2025-05-07T20:32:43.7140592Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7141056Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7141319Z         self,
2025-05-07T20:32:43.7141533Z         T: int,
2025-05-07T20:32:43.7141744Z         D: int,
2025-05-07T20:32:43.7141981Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7142273Z         contiguous: bool,
2025-05-07T20:32:43.7142529Z         compiled: bool,
2025-05-07T20:32:43.7142773Z     ) -> None:
2025-05-07T20:32:43.7143009Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7143265Z     
2025-05-07T20:32:43.7143560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7143931Z     
2025-05-07T20:32:43.7144135Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.7144452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.7144791Z         x = x_sign * x_clamp
2025-05-07T20:32:43.7145050Z         x0 = x[:, :D]
2025-05-07T20:32:43.7145278Z         x1 = x[:, D:]
2025-05-07T20:32:43.7145504Z     
2025-05-07T20:32:43.7145709Z         if contiguous:
2025-05-07T20:32:43.7145950Z             x0 = x0.contiguous()
2025-05-07T20:32:43.7146231Z             x1 = x1.contiguous()
2025-05-07T20:32:43.7146492Z     
2025-05-07T20:32:43.7146710Z         if scale_ub is not None:
2025-05-07T20:32:43.7153913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.7154299Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.7154632Z             )
2025-05-07T20:32:43.7154850Z         else:
2025-05-07T20:32:43.7155086Z             scale_ub_tensor = None
2025-05-07T20:32:43.7155351Z     
2025-05-07T20:32:43.7155610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.7155964Z             op = silu_mul_quant
2025-05-07T20:32:43.7156327Z             if compiled:
2025-05-07T20:32:43.7156594Z                 op = torch.compile(op)
2025-05-07T20:32:43.7156919Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7157217Z     
2025-05-07T20:32:43.7157422Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.7157609Z 
2025-05-07T20:32:43.7157719Z moe/activation_test.py:117: 
2025-05-07T20:32:43.7158043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7158398Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.7158710Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7159494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.7160308Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.7160881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.7161702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.7162418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.7162979Z     kernel = self.compile(
2025-05-07T20:32:43.7163559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.7164268Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.7164702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7164951Z 
2025-05-07T20:32:43.7165173Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad73ca50>
2025-05-07T20:32:43.7166316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.7167767Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad73c720>}
2025-05-07T20:32:43.7169170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.7170246Z context = <triton._C.libtriton.ir.context object at 0x7f07ad3d96b0>
2025-05-07T20:32:43.7170553Z 
2025-05-07T20:32:43.7170735Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.7171297Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.7171799Z                            module_map=module_map)
2025-05-07T20:32:43.7172186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.7172576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.7172860Z E       ^
2025-05-07T20:32:43.7173349Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.7173832Z 
2025-05-07T20:32:43.7174270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.7174817Z 
2025-05-07T20:32:43.7174929Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.7175375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.7175801Z     T=128,
2025-05-07T20:32:43.7176007Z     D=5120,
2025-05-07T20:32:43.7176223Z     scale_ub=1200.0,
2025-05-07T20:32:43.7176462Z     contiguous=True,
2025-05-07T20:32:43.7176705Z     compiled=False,
2025-05-07T20:32:43.7176935Z )
2025-05-07T20:32:43.7177282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.7177855Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.7178157Z 
2025-05-07T20:32:43.7178243Z     @given(
2025-05-07T20:32:43.7178497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.7178827Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.7179164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.7179517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.7179867Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.7180177Z     )
2025-05-07T20:32:43.7180554Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.7181070Z     def test_silu_mul_quant(
2025-05-07T20:32:43.7181324Z         self,
2025-05-07T20:32:43.7181539Z         T: int,
2025-05-07T20:32:43.7181757Z         D: int,
2025-05-07T20:32:43.7181989Z         scale_ub: Optional[float],
2025-05-07T20:32:43.7182329Z         contiguous: bool,
2025-05-07T20:32:43.7182594Z         compiled: bool,
2025-05-07T20:32:43.7182837Z     ) -> None:
2025-05-07T20:32:43.7183114Z         torch.manual_seed(2025)
2025-05-07T20:32:43.7183384Z     
2025-05-07T20:32:43.7183672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.7184043Z     
2025-05-07T20:32:43.7184261Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.7184569Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.7184909Z         x = x_sign * x_clamp
2025-05-07T20:32:43.7185171Z         x0 = x[:, :D]
2025-05-07T20:32:43.7185398Z         x1 = x[:, D:]
2025-05-07T20:32:43.7185630Z     
2025-05-07T20:32:43.7185839Z         if contiguous:
2025-05-07T20:32:43.7186083Z             x0 = x0.contiguous()
2025-05-07T20:32:43.7186368Z             x1 = x1.contiguous()
2025-05-07T20:32:43.7186630Z     
2025-05-07T20:32:43.7186833Z         if scale_ub is not None:
2025-05-07T20:32:43.7187131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.7187499Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.7187828Z             )
2025-05-07T20:32:43.7188030Z         else:
2025-05-07T20:32:43.7188254Z             scale_ub_tensor = None
2025-05-07T20:32:43.7188520Z     
2025-05-07T20:32:43.7188763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.7189096Z             op = silu_mul_quant
2025-05-07T20:32:43.7189361Z             if compiled:
2025-05-07T20:32:43.7189619Z                 op = torch.compile(op)
2025-05-07T20:32:43.7189932Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7190221Z     
2025-05-07T20:32:43.7190420Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.7190601Z 
2025-05-07T20:32:43.7190706Z moe/activation_test.py:117: 
2025-05-07T20:32:43.7191015Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7191367Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.7191657Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.7192385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.7193104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.7193659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.7194372Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.7195064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.7195625Z     kernel = self.compile(
2025-05-07T20:32:43.7196181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.7196866Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.7197284Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.7197523Z 
2025-05-07T20:32:43.7197802Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899ac9520>
2025-05-07T20:32:43.7198914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.7200425Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad3f8c20>}
2025-05-07T20:32:43.7201820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.7202931Z context = <triton._C.libtriton.ir.context object at 0x7f07ad55afb0>
2025-05-07T20:32:43.7203232Z 
2025-05-07T20:32:43.7203444Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.7204585Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.7205083Z                            module_map=module_map)
2025-05-07T20:32:43.7205467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.7205833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.7206111Z E       ^
2025-05-07T20:32:43.7206599Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.7207063Z 
2025-05-07T20:32:43.7207497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8731580Z 
2025-05-07T20:32:43.8731958Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8732627Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8733269Z     T=1,
2025-05-07T20:32:43.8733531Z     D=7168,
2025-05-07T20:32:43.8733807Z     scale_ub=1200.0,
2025-05-07T20:32:43.8734058Z     contiguous=True,
2025-05-07T20:32:43.8734292Z     compiled=True,
2025-05-07T20:32:43.8734517Z )
2025-05-07T20:32:43.8734861Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8735380Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.8735657Z 
2025-05-07T20:32:43.8735739Z     @given(
2025-05-07T20:32:43.8735986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8736320Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8736649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8737004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8737357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8737655Z     )
2025-05-07T20:32:43.8738031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8738512Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8738774Z         self,
2025-05-07T20:32:43.8738976Z         T: int,
2025-05-07T20:32:43.8739189Z         D: int,
2025-05-07T20:32:43.8739425Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8739710Z         contiguous: bool,
2025-05-07T20:32:43.8739972Z         compiled: bool,
2025-05-07T20:32:43.8740214Z     ) -> None:
2025-05-07T20:32:43.8740436Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8740694Z     
2025-05-07T20:32:43.8740982Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8741342Z     
2025-05-07T20:32:43.8741559Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8741877Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8742202Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8742459Z         x0 = x[:, :D]
2025-05-07T20:32:43.8742691Z         x1 = x[:, D:]
2025-05-07T20:32:43.8742909Z     
2025-05-07T20:32:43.8743107Z         if contiguous:
2025-05-07T20:32:43.8743640Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8743920Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8744183Z     
2025-05-07T20:32:43.8744389Z         if scale_ub is not None:
2025-05-07T20:32:43.8744676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8745038Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8745371Z             )
2025-05-07T20:32:43.8745578Z         else:
2025-05-07T20:32:43.8745798Z             scale_ub_tensor = None
2025-05-07T20:32:43.8746066Z     
2025-05-07T20:32:43.8746312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8746734Z             op = silu_mul_quant
2025-05-07T20:32:43.8747005Z             if compiled:
2025-05-07T20:32:43.8747266Z                 op = torch.compile(op)
2025-05-07T20:32:43.8747575Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8747870Z     
2025-05-07T20:32:43.8748155Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8748328Z 
2025-05-07T20:32:43.8748439Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8748824Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8749186Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8749514Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8750107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.8750695Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.8751394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8752122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8752686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8753403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8754132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8754697Z     kernel = self.compile(
2025-05-07T20:32:43.8755271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8755960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8756385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8756635Z 
2025-05-07T20:32:43.8756855Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adb13650>
2025-05-07T20:32:43.8758000Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8759447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad3f9ee0>}
2025-05-07T20:32:43.8760983Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8762056Z context = <triton._C.libtriton.ir.context object at 0x7f07ad5505b0>
2025-05-07T20:32:43.8762358Z 
2025-05-07T20:32:43.8762539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8763090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8763581Z                            module_map=module_map)
2025-05-07T20:32:43.8763969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8764344Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8764615Z E       ^
2025-05-07T20:32:43.8765155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8765629Z 
2025-05-07T20:32:43.8766069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.8766600Z 
2025-05-07T20:32:43.8766715Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.8767145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.8767567Z     T=1,
2025-05-07T20:32:43.8767764Z     D=7168,
2025-05-07T20:32:43.8767962Z     scale_ub=1200.0,
2025-05-07T20:32:43.8768241Z     contiguous=False,
2025-05-07T20:32:43.8768477Z     compiled=True,
2025-05-07T20:32:43.8768688Z )
2025-05-07T20:32:43.8769023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.8769539Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.8769867Z 
2025-05-07T20:32:43.8769948Z     @given(
2025-05-07T20:32:43.8770202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.8770574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.8770903Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.8771244Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.8771597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.8771900Z     )
2025-05-07T20:32:43.8772263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.8772733Z     def test_silu_mul_quant(
2025-05-07T20:32:43.8772998Z         self,
2025-05-07T20:32:43.8773202Z         T: int,
2025-05-07T20:32:43.8773410Z         D: int,
2025-05-07T20:32:43.8773642Z         scale_ub: Optional[float],
2025-05-07T20:32:43.8773925Z         contiguous: bool,
2025-05-07T20:32:43.8774181Z         compiled: bool,
2025-05-07T20:32:43.8774416Z     ) -> None:
2025-05-07T20:32:43.8774643Z         torch.manual_seed(2025)
2025-05-07T20:32:43.8774906Z     
2025-05-07T20:32:43.8775197Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.8775562Z     
2025-05-07T20:32:43.8775763Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.8776076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.8776405Z         x = x_sign * x_clamp
2025-05-07T20:32:43.8776657Z         x0 = x[:, :D]
2025-05-07T20:32:43.8776888Z         x1 = x[:, D:]
2025-05-07T20:32:43.8777110Z     
2025-05-07T20:32:43.8777303Z         if contiguous:
2025-05-07T20:32:43.8777548Z             x0 = x0.contiguous()
2025-05-07T20:32:43.8777827Z             x1 = x1.contiguous()
2025-05-07T20:32:43.8778078Z     
2025-05-07T20:32:43.8778283Z         if scale_ub is not None:
2025-05-07T20:32:43.8778574Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.8778924Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.8779257Z             )
2025-05-07T20:32:43.8779468Z         else:
2025-05-07T20:32:43.8779693Z             scale_ub_tensor = None
2025-05-07T20:32:43.8779959Z     
2025-05-07T20:32:43.8780211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.8780542Z             op = silu_mul_quant
2025-05-07T20:32:43.8780804Z             if compiled:
2025-05-07T20:32:43.8781068Z                 op = torch.compile(op)
2025-05-07T20:32:43.8781382Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8781667Z     
2025-05-07T20:32:43.8781870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.8782042Z 
2025-05-07T20:32:43.8782154Z moe/activation_test.py:117: 
2025-05-07T20:32:43.8782468Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8782826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.8783133Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.8783718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.8784310Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.8785050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.8785771Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.8786329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.8787045Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.8787739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.8788334Z     kernel = self.compile(
2025-05-07T20:32:43.8788894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.8789579Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.8790039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.8790281Z 
2025-05-07T20:32:43.8790532Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adb126d0>
2025-05-07T20:32:43.8791662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.8793083Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad3fac00>}
2025-05-07T20:32:43.8794483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.8795549Z context = <triton._C.libtriton.ir.context object at 0x7f07ad582270>
2025-05-07T20:32:43.8795856Z 
2025-05-07T20:32:43.8796034Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.8796593Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.8797085Z                            module_map=module_map)
2025-05-07T20:32:43.8797471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.8797840Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.8798115Z E       ^
2025-05-07T20:32:43.8798602Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.8799071Z 
2025-05-07T20:32:43.8799507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.0839570Z 
2025-05-07T20:32:44.0840010Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.0840761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.0841395Z     T=1,
2025-05-07T20:32:44.0841647Z     D=7168,
2025-05-07T20:32:44.0841919Z     scale_ub=None,
2025-05-07T20:32:44.0842206Z     contiguous=False,
2025-05-07T20:32:44.0842500Z     compiled=True,
2025-05-07T20:32:44.0842775Z )
2025-05-07T20:32:44.0843160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.0843683Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.0843959Z 
2025-05-07T20:32:44.0844039Z     @given(
2025-05-07T20:32:44.0844283Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.0844623Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.0844942Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.0845293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.0845642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.0845938Z     )
2025-05-07T20:32:44.0846595Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.0847071Z     def test_silu_mul_quant(
2025-05-07T20:32:44.0847325Z         self,
2025-05-07T20:32:44.0847524Z         T: int,
2025-05-07T20:32:44.0847733Z         D: int,
2025-05-07T20:32:44.0847972Z         scale_ub: Optional[float],
2025-05-07T20:32:44.0848292Z         contiguous: bool,
2025-05-07T20:32:44.0848544Z         compiled: bool,
2025-05-07T20:32:44.0848782Z     ) -> None:
2025-05-07T20:32:44.0849003Z         torch.manual_seed(2025)
2025-05-07T20:32:44.0849258Z     
2025-05-07T20:32:44.0849549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.0849996Z     
2025-05-07T20:32:44.0850200Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.0850508Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.0850835Z         x = x_sign * x_clamp
2025-05-07T20:32:44.0851091Z         x0 = x[:, :D]
2025-05-07T20:32:44.0851422Z         x1 = x[:, D:]
2025-05-07T20:32:44.0851635Z     
2025-05-07T20:32:44.0851836Z         if contiguous:
2025-05-07T20:32:44.0852153Z             x0 = x0.contiguous()
2025-05-07T20:32:44.0852426Z             x1 = x1.contiguous()
2025-05-07T20:32:44.0852685Z     
2025-05-07T20:32:44.0852886Z         if scale_ub is not None:
2025-05-07T20:32:44.0853185Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.0853539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.0853869Z             )
2025-05-07T20:32:44.0854074Z         else:
2025-05-07T20:32:44.0854291Z             scale_ub_tensor = None
2025-05-07T20:32:44.0854557Z     
2025-05-07T20:32:44.0854805Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.0855175Z             op = silu_mul_quant
2025-05-07T20:32:44.0855446Z             if compiled:
2025-05-07T20:32:44.0855701Z                 op = torch.compile(op)
2025-05-07T20:32:44.0856016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0856311Z     
2025-05-07T20:32:44.0856510Z         y_fp8, y_scale = fn()
2025-05-07T20:32:44.0856823Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:44.0857132Z     
2025-05-07T20:32:44.0857380Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.0857740Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:44.0858052Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:44.0858376Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:44.0858759Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.0859090Z     
2025-05-07T20:32:44.0859304Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:44.0859510Z 
2025-05-07T20:32:44.0859615Z moe/activation_test.py:126: 
2025-05-07T20:32:44.0859929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0860285Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:44.0860629Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:44.0861469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:44.0862267Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:44.0862856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.0863573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.0864302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:44.0865075Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:44.0865847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:44.0866532Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:44.0867235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:44.0867783Z     fn()
2025-05-07T20:32:44.0868320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:44.0868940Z     self.fn.run(
2025-05-07T20:32:44.0869440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.0869997Z     kernel = self.compile(
2025-05-07T20:32:44.0870570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.0871310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.0871733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0871977Z 
2025-05-07T20:32:44.0872199Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad751ed0>
2025-05-07T20:32:44.0873427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.0875003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9c180>}
2025-05-07T20:32:44.0876574Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.0877778Z context = <triton._C.libtriton.ir.context object at 0x7f07ada393b0>
2025-05-07T20:32:44.0878112Z 
2025-05-07T20:32:44.0878543Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.0886819Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.0887339Z                            module_map=module_map)
2025-05-07T20:32:44.0887723Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.0888098Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:44.0888387Z E       ^
2025-05-07T20:32:44.0888879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.0889361Z 
2025-05-07T20:32:44.0889804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.0890347Z 
2025-05-07T20:32:44.0890465Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.0890913Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.0891338Z     T=1,
2025-05-07T20:32:44.0891542Z     D=5120,
2025-05-07T20:32:44.0891761Z     scale_ub=1200.0,
2025-05-07T20:32:44.0891997Z     contiguous=False,
2025-05-07T20:32:44.0892243Z     compiled=True,
2025-05-07T20:32:44.0892466Z )
2025-05-07T20:32:44.0892801Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.0893318Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.0893596Z 
2025-05-07T20:32:44.0893692Z     @given(
2025-05-07T20:32:44.0893933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.0894269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.0894597Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.0894950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.0895292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.0895595Z     )
2025-05-07T20:32:44.0895965Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.0896421Z     def test_silu_mul_quant(
2025-05-07T20:32:44.0896680Z         self,
2025-05-07T20:32:44.0896973Z         T: int,
2025-05-07T20:32:44.0897181Z         D: int,
2025-05-07T20:32:44.0897414Z         scale_ub: Optional[float],
2025-05-07T20:32:44.0897705Z         contiguous: bool,
2025-05-07T20:32:44.0897958Z         compiled: bool,
2025-05-07T20:32:44.0898199Z     ) -> None:
2025-05-07T20:32:44.0898433Z         torch.manual_seed(2025)
2025-05-07T20:32:44.0898684Z     
2025-05-07T20:32:44.0898973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.0899335Z     
2025-05-07T20:32:44.0899536Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.0899845Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.0900221Z         x = x_sign * x_clamp
2025-05-07T20:32:44.0900480Z         x0 = x[:, :D]
2025-05-07T20:32:44.0900704Z         x1 = x[:, D:]
2025-05-07T20:32:44.0900927Z     
2025-05-07T20:32:44.0901127Z         if contiguous:
2025-05-07T20:32:44.0901365Z             x0 = x0.contiguous()
2025-05-07T20:32:44.0901687Z             x1 = x1.contiguous()
2025-05-07T20:32:44.0901952Z     
2025-05-07T20:32:44.0902197Z         if scale_ub is not None:
2025-05-07T20:32:44.0902491Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.0902849Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.0903170Z             )
2025-05-07T20:32:44.0903382Z         else:
2025-05-07T20:32:44.0903610Z             scale_ub_tensor = None
2025-05-07T20:32:44.0903871Z     
2025-05-07T20:32:44.0904122Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.0904460Z             op = silu_mul_quant
2025-05-07T20:32:44.0904725Z             if compiled:
2025-05-07T20:32:44.0904991Z                 op = torch.compile(op)
2025-05-07T20:32:44.0905309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0905603Z     
2025-05-07T20:32:44.0905803Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.0905984Z 
2025-05-07T20:32:44.0906098Z moe/activation_test.py:117: 
2025-05-07T20:32:44.0906414Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0906765Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.0907068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.0907658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.0908239Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.0908933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.0909649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.0910219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.0910927Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.0911624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.0912192Z     kernel = self.compile(
2025-05-07T20:32:44.0912766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.0913868Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.0914298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.0914541Z 
2025-05-07T20:32:44.0914768Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ada1e4d0>
2025-05-07T20:32:44.0915892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.0917331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9d300>}
2025-05-07T20:32:44.0918858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.0919927Z context = <triton._C.libtriton.ir.context object at 0x7f07ad0c1ab0>
2025-05-07T20:32:44.0920311Z 
2025-05-07T20:32:44.0920497Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.0921050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.0921641Z                            module_map=module_map)
2025-05-07T20:32:44.0922031Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.0922401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.0922682Z E       ^
2025-05-07T20:32:44.0923171Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.0923705Z 
2025-05-07T20:32:44.0924210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2314739Z 
2025-05-07T20:32:44.2315110Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2315776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2316383Z     T=1,
2025-05-07T20:32:44.2316646Z     D=5120,
2025-05-07T20:32:44.2316900Z     scale_ub=1200.0,
2025-05-07T20:32:44.2317147Z     contiguous=False,
2025-05-07T20:32:44.2317386Z     compiled=False,
2025-05-07T20:32:44.2317629Z )
2025-05-07T20:32:44.2317978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2318495Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.2318786Z 
2025-05-07T20:32:44.2318868Z     @given(
2025-05-07T20:32:44.2319118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2319459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2319796Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2320261Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2320618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2320920Z     )
2025-05-07T20:32:44.2321297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2321767Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2322022Z         self,
2025-05-07T20:32:44.2322238Z         T: int,
2025-05-07T20:32:44.2322458Z         D: int,
2025-05-07T20:32:44.2322701Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2323003Z         contiguous: bool,
2025-05-07T20:32:44.2323270Z         compiled: bool,
2025-05-07T20:32:44.2323506Z     ) -> None:
2025-05-07T20:32:44.2323741Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2324002Z     
2025-05-07T20:32:44.2324292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2324662Z     
2025-05-07T20:32:44.2324877Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2325193Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2325522Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2325787Z         x0 = x[:, :D]
2025-05-07T20:32:44.2326024Z         x1 = x[:, D:]
2025-05-07T20:32:44.2326243Z     
2025-05-07T20:32:44.2326443Z         if contiguous:
2025-05-07T20:32:44.2326695Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2326972Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2327262Z     
2025-05-07T20:32:44.2327466Z         if scale_ub is not None:
2025-05-07T20:32:44.2327760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2328122Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2328444Z             )
2025-05-07T20:32:44.2328654Z         else:
2025-05-07T20:32:44.2328884Z             scale_ub_tensor = None
2025-05-07T20:32:44.2329158Z     
2025-05-07T20:32:44.2329671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2330018Z             op = silu_mul_quant
2025-05-07T20:32:44.2330282Z             if compiled:
2025-05-07T20:32:44.2330555Z                 op = torch.compile(op)
2025-05-07T20:32:44.2330875Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2331167Z     
2025-05-07T20:32:44.2331381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2331558Z 
2025-05-07T20:32:44.2331670Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2331989Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2332426Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2332731Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2333452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2334175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2334900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2335626Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2336329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2336895Z     kernel = self.compile(
2025-05-07T20:32:44.2337471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2338162Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2338594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2338839Z 
2025-05-07T20:32:44.2339069Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adffad50>
2025-05-07T20:32:44.2340205Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2341656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9e020>}
2025-05-07T20:32:44.2343064Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2344140Z context = <triton._C.libtriton.ir.context object at 0x7f07ada47270>
2025-05-07T20:32:44.2344445Z 
2025-05-07T20:32:44.2344627Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2345173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2345673Z                            module_map=module_map)
2025-05-07T20:32:44.2346063Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2346442Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2346713Z E       ^
2025-05-07T20:32:44.2347203Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2347675Z 
2025-05-07T20:32:44.2348135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2348702Z 
2025-05-07T20:32:44.2348816Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2349264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2349692Z     T=16384,
2025-05-07T20:32:44.2349902Z     D=5120,
2025-05-07T20:32:44.2350104Z     scale_ub=1200.0,
2025-05-07T20:32:44.2350344Z     contiguous=False,
2025-05-07T20:32:44.2350599Z     compiled=True,
2025-05-07T20:32:44.2350821Z )
2025-05-07T20:32:44.2351201Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.2351739Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.2352037Z 
2025-05-07T20:32:44.2352127Z     @given(
2025-05-07T20:32:44.2352366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.2352704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.2353033Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.2353388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.2353736Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.2354088Z     )
2025-05-07T20:32:44.2354460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.2354925Z     def test_silu_mul_quant(
2025-05-07T20:32:44.2355187Z         self,
2025-05-07T20:32:44.2355400Z         T: int,
2025-05-07T20:32:44.2355647Z         D: int,
2025-05-07T20:32:44.2355883Z         scale_ub: Optional[float],
2025-05-07T20:32:44.2356177Z         contiguous: bool,
2025-05-07T20:32:44.2356468Z         compiled: bool,
2025-05-07T20:32:44.2356711Z     ) -> None:
2025-05-07T20:32:44.2356947Z         torch.manual_seed(2025)
2025-05-07T20:32:44.2357201Z     
2025-05-07T20:32:44.2357493Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.2357857Z     
2025-05-07T20:32:44.2358067Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.2358412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.2358759Z         x = x_sign * x_clamp
2025-05-07T20:32:44.2359017Z         x0 = x[:, :D]
2025-05-07T20:32:44.2359244Z         x1 = x[:, D:]
2025-05-07T20:32:44.2359465Z     
2025-05-07T20:32:44.2359668Z         if contiguous:
2025-05-07T20:32:44.2359909Z             x0 = x0.contiguous()
2025-05-07T20:32:44.2360281Z             x1 = x1.contiguous()
2025-05-07T20:32:44.2360540Z     
2025-05-07T20:32:44.2360749Z         if scale_ub is not None:
2025-05-07T20:32:44.2361047Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.2361410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.2361736Z             )
2025-05-07T20:32:44.2361945Z         else:
2025-05-07T20:32:44.2362175Z             scale_ub_tensor = None
2025-05-07T20:32:44.2362441Z     
2025-05-07T20:32:44.2362692Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.2363032Z             op = silu_mul_quant
2025-05-07T20:32:44.2363300Z             if compiled:
2025-05-07T20:32:44.2363565Z                 op = torch.compile(op)
2025-05-07T20:32:44.2363886Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2364180Z     
2025-05-07T20:32:44.2364381Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.2364561Z 
2025-05-07T20:32:44.2364667Z moe/activation_test.py:117: 
2025-05-07T20:32:44.2364983Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2365336Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.2365640Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.2366232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.2366815Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.2367511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.2368239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.2368813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.2369528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.2370232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.2370794Z     kernel = self.compile(
2025-05-07T20:32:44.2371429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.2372123Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.2372548Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.2372791Z 
2025-05-07T20:32:44.2373016Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adb119d0>
2025-05-07T20:32:44.2374145Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.2375616Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9f600>}
2025-05-07T20:32:44.2377053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.2378171Z context = <triton._C.libtriton.ir.context object at 0x7f07ad0dc030>
2025-05-07T20:32:44.2378477Z 
2025-05-07T20:32:44.2378659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.2379205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.2379696Z                            module_map=module_map)
2025-05-07T20:32:44.2380083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.2380460Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.2380729Z E       ^
2025-05-07T20:32:44.2381223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.2381691Z 
2025-05-07T20:32:44.2382134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.2382669Z 
2025-05-07T20:32:44.2382784Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.2383228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.2383656Z     T=2048,
2025-05-07T20:32:44.2383857Z     D=7168,
2025-05-07T20:32:44.2384059Z     scale_ub=1200.0,
2025-05-07T20:32:44.2384300Z     contiguous=False,
2025-05-07T20:32:44.2384540Z     compiled=True,
2025-05-07T20:32:44.4255578Z )
2025-05-07T20:32:44.4256177Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4256922Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.4257321Z 
2025-05-07T20:32:44.4257431Z     @given(
2025-05-07T20:32:44.4257742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4258071Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4258401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4258761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4259108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4259409Z     )
2025-05-07T20:32:44.4259779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4260247Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4260501Z         self,
2025-05-07T20:32:44.4260711Z         T: int,
2025-05-07T20:32:44.4260923Z         D: int,
2025-05-07T20:32:44.4261153Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4261449Z         contiguous: bool,
2025-05-07T20:32:44.4261718Z         compiled: bool,
2025-05-07T20:32:44.4261954Z     ) -> None:
2025-05-07T20:32:44.4262182Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4262439Z     
2025-05-07T20:32:44.4262724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4263086Z     
2025-05-07T20:32:44.4263299Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.4263917Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.4264252Z         x = x_sign * x_clamp
2025-05-07T20:32:44.4264507Z         x0 = x[:, :D]
2025-05-07T20:32:44.4264731Z         x1 = x[:, D:]
2025-05-07T20:32:44.4264952Z     
2025-05-07T20:32:44.4265153Z         if contiguous:
2025-05-07T20:32:44.4265391Z             x0 = x0.contiguous()
2025-05-07T20:32:44.4265667Z             x1 = x1.contiguous()
2025-05-07T20:32:44.4265921Z     
2025-05-07T20:32:44.4266123Z         if scale_ub is not None:
2025-05-07T20:32:44.4266405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.4266854Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.4267179Z             )
2025-05-07T20:32:44.4267378Z         else:
2025-05-07T20:32:44.4267600Z             scale_ub_tensor = None
2025-05-07T20:32:44.4267864Z     
2025-05-07T20:32:44.4268104Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4268515Z             op = silu_mul_quant
2025-05-07T20:32:44.4268783Z             if compiled:
2025-05-07T20:32:44.4269104Z                 op = torch.compile(op)
2025-05-07T20:32:44.4269420Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4269713Z     
2025-05-07T20:32:44.4269910Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.4270090Z 
2025-05-07T20:32:44.4270196Z moe/activation_test.py:117: 
2025-05-07T20:32:44.4270509Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4270866Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.4271159Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4271748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.4272337Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.4273025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.4273752Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.4274323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.4275036Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.4275731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.4276290Z     kernel = self.compile(
2025-05-07T20:32:44.4276858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.4277549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.4277971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4278216Z 
2025-05-07T20:32:44.4278435Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf54bd0>
2025-05-07T20:32:44.4279573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.4281172Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03c720>}
2025-05-07T20:32:44.4282567Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.4283636Z context = <triton._C.libtriton.ir.context object at 0x7f07ad013830>
2025-05-07T20:32:44.4283948Z 
2025-05-07T20:32:44.4284124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.4284681Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.4285223Z                            module_map=module_map)
2025-05-07T20:32:44.4285618Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.4285995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.4286268Z E       ^
2025-05-07T20:32:44.4286760Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.4287236Z 
2025-05-07T20:32:44.4287673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.4288251Z 
2025-05-07T20:32:44.4288366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4288795Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4289219Z     T=1,
2025-05-07T20:32:44.4289415Z     D=5120,
2025-05-07T20:32:44.4289617Z     scale_ub=None,
2025-05-07T20:32:44.4289848Z     contiguous=False,
2025-05-07T20:32:44.4290128Z     compiled=False,
2025-05-07T20:32:44.4290339Z )
2025-05-07T20:32:44.4290731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4291249Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:44.4291522Z 
2025-05-07T20:32:44.4291612Z     @given(
2025-05-07T20:32:44.4291852Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4292187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4292513Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4292860Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4293212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4293515Z     )
2025-05-07T20:32:44.4293879Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4294340Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4294595Z         self,
2025-05-07T20:32:44.4294798Z         T: int,
2025-05-07T20:32:44.4295005Z         D: int,
2025-05-07T20:32:44.4295240Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4295531Z         contiguous: bool,
2025-05-07T20:32:44.4295788Z         compiled: bool,
2025-05-07T20:32:44.4296022Z     ) -> None:
2025-05-07T20:32:44.4296249Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4296501Z     
2025-05-07T20:32:44.4296786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4304791Z     
2025-05-07T20:32:44.4305038Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.4305353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.4305701Z         x = x_sign * x_clamp
2025-05-07T20:32:44.4305962Z         x0 = x[:, :D]
2025-05-07T20:32:44.4306192Z         x1 = x[:, D:]
2025-05-07T20:32:44.4306413Z     
2025-05-07T20:32:44.4306616Z         if contiguous:
2025-05-07T20:32:44.4306863Z             x0 = x0.contiguous()
2025-05-07T20:32:44.4307133Z             x1 = x1.contiguous()
2025-05-07T20:32:44.4307395Z     
2025-05-07T20:32:44.4307608Z         if scale_ub is not None:
2025-05-07T20:32:44.4307901Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.4308316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.4308648Z             )
2025-05-07T20:32:44.4308849Z         else:
2025-05-07T20:32:44.4309076Z             scale_ub_tensor = None
2025-05-07T20:32:44.4309345Z     
2025-05-07T20:32:44.4309588Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4309926Z             op = silu_mul_quant
2025-05-07T20:32:44.4310192Z             if compiled:
2025-05-07T20:32:44.4310455Z                 op = torch.compile(op)
2025-05-07T20:32:44.4310771Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4311063Z     
2025-05-07T20:32:44.4311264Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.4311447Z 
2025-05-07T20:32:44.4311553Z moe/activation_test.py:117: 
2025-05-07T20:32:44.4311870Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4312307Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.4312609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4313704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.4314430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.4314994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.4315713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.4316508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.4317067Z     kernel = self.compile(
2025-05-07T20:32:44.4317635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.4318437Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.4318939Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4319181Z 
2025-05-07T20:32:44.4319406Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b22150>
2025-05-07T20:32:44.4320590Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.4322028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03d120>}
2025-05-07T20:32:44.4323430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.4324504Z context = <triton._C.libtriton.ir.context object at 0x7f07acf4ec70>
2025-05-07T20:32:44.4324808Z 
2025-05-07T20:32:44.4324990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.4325529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.4326024Z                            module_map=module_map)
2025-05-07T20:32:44.4326407Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.4326772Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.4327047Z E       ^
2025-05-07T20:32:44.4327539Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.4328007Z 
2025-05-07T20:32:44.4328448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.4328982Z 
2025-05-07T20:32:44.4329090Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.4329532Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.4329961Z     T=4096,
2025-05-07T20:32:44.4330156Z     D=7168,
2025-05-07T20:32:44.4330363Z     scale_ub=1200.0,
2025-05-07T20:32:44.4330604Z     contiguous=False,
2025-05-07T20:32:44.4330837Z     compiled=False,
2025-05-07T20:32:44.4331054Z )
2025-05-07T20:32:44.4331391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.4331910Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.4332203Z 
2025-05-07T20:32:44.4332288Z     @given(
2025-05-07T20:32:44.4332532Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.4332865Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.4333183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.4333533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.4333888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.4334258Z     )
2025-05-07T20:32:44.4334635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.4335097Z     def test_silu_mul_quant(
2025-05-07T20:32:44.4335346Z         self,
2025-05-07T20:32:44.4335557Z         T: int,
2025-05-07T20:32:44.4335766Z         D: int,
2025-05-07T20:32:44.4335991Z         scale_ub: Optional[float],
2025-05-07T20:32:44.4336278Z         contiguous: bool,
2025-05-07T20:32:44.4336533Z         compiled: bool,
2025-05-07T20:32:44.4336768Z     ) -> None:
2025-05-07T20:32:44.4336987Z         torch.manual_seed(2025)
2025-05-07T20:32:44.4337292Z     
2025-05-07T20:32:44.4337581Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.4337930Z     
2025-05-07T20:32:44.4338137Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.4338446Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.4338810Z         x = x_sign * x_clamp
2025-05-07T20:32:44.4339065Z         x0 = x[:, :D]
2025-05-07T20:32:44.4339358Z         x1 = x[:, D:]
2025-05-07T20:32:44.4339573Z     
2025-05-07T20:32:44.4339768Z         if contiguous:
2025-05-07T20:32:44.4340014Z             x0 = x0.contiguous()
2025-05-07T20:32:44.4340277Z             x1 = x1.contiguous()
2025-05-07T20:32:44.4340519Z     
2025-05-07T20:32:44.4340720Z         if scale_ub is not None:
2025-05-07T20:32:44.4341000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.4341352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.4341677Z             )
2025-05-07T20:32:44.4341884Z         else:
2025-05-07T20:32:44.4342101Z             scale_ub_tensor = None
2025-05-07T20:32:44.4342368Z     
2025-05-07T20:32:44.4342616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.4342941Z             op = silu_mul_quant
2025-05-07T20:32:44.4343207Z             if compiled:
2025-05-07T20:32:44.4343470Z                 op = torch.compile(op)
2025-05-07T20:32:44.4343777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4344060Z     
2025-05-07T20:32:44.4344260Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.4344432Z 
2025-05-07T20:32:44.4344537Z moe/activation_test.py:117: 
2025-05-07T20:32:44.4344852Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4345208Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.4345499Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.4346220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.4346942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.4347506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.4348237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.4348966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.4349525Z     kernel = self.compile(
2025-05-07T20:32:44.4350093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.4350796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.4351217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.4351461Z 
2025-05-07T20:32:44.4351678Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ada1ced0>
2025-05-07T20:32:44.4352801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.4354274Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03e480>}
2025-05-07T20:32:44.4355676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.4356740Z context = <triton._C.libtriton.ir.context object at 0x7f08981cf9f0>
2025-05-07T20:32:44.4357039Z 
2025-05-07T20:32:44.4357220Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.4357760Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.4358332Z                            module_map=module_map)
2025-05-07T20:32:44.4358726Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.4359099Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.4359367Z E       ^
2025-05-07T20:32:44.4359903Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.4360478Z 
2025-05-07T20:32:44.4360924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.5910729Z 
2025-05-07T20:32:44.5911331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5911977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5912560Z     T=16384,
2025-05-07T20:32:44.5912771Z     D=7168,
2025-05-07T20:32:44.5912969Z     scale_ub=None,
2025-05-07T20:32:44.5913222Z     contiguous=True,
2025-05-07T20:32:44.5913736Z     compiled=True,
2025-05-07T20:32:44.5913958Z )
2025-05-07T20:32:44.5914298Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5914808Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:44.5915108Z 
2025-05-07T20:32:44.5915190Z     @given(
2025-05-07T20:32:44.5915442Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5915774Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5916100Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5916453Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5916801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5917096Z     )
2025-05-07T20:32:44.5917462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5917925Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5918173Z         self,
2025-05-07T20:32:44.5918384Z         T: int,
2025-05-07T20:32:44.5918597Z         D: int,
2025-05-07T20:32:44.5918820Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5919110Z         contiguous: bool,
2025-05-07T20:32:44.5919362Z         compiled: bool,
2025-05-07T20:32:44.5919601Z     ) -> None:
2025-05-07T20:32:44.5919823Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5920179Z     
2025-05-07T20:32:44.5920473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5920824Z     
2025-05-07T20:32:44.5921028Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.5921333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.5921653Z         x = x_sign * x_clamp
2025-05-07T20:32:44.5921903Z         x0 = x[:, :D]
2025-05-07T20:32:44.5922129Z         x1 = x[:, D:]
2025-05-07T20:32:44.5922340Z     
2025-05-07T20:32:44.5922537Z         if contiguous:
2025-05-07T20:32:44.5922779Z             x0 = x0.contiguous()
2025-05-07T20:32:44.5923045Z             x1 = x1.contiguous()
2025-05-07T20:32:44.5923297Z     
2025-05-07T20:32:44.5923496Z         if scale_ub is not None:
2025-05-07T20:32:44.5923773Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.5924123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.5924444Z             )
2025-05-07T20:32:44.5924648Z         else:
2025-05-07T20:32:44.5925138Z             scale_ub_tensor = None
2025-05-07T20:32:44.5925402Z     
2025-05-07T20:32:44.5925644Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.5925967Z             op = silu_mul_quant
2025-05-07T20:32:44.5926225Z             if compiled:
2025-05-07T20:32:44.5926483Z                 op = torch.compile(op)
2025-05-07T20:32:44.5926786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.5927074Z     
2025-05-07T20:32:44.5927276Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.5927446Z 
2025-05-07T20:32:44.5927551Z moe/activation_test.py:117: 
2025-05-07T20:32:44.5927936Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.5928297Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.5928589Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.5929174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.5929824Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.5930583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.5931296Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.5931847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.5932556Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.5933246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.5933804Z     kernel = self.compile(
2025-05-07T20:32:44.5934359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.5935044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.5935467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.5935709Z 
2025-05-07T20:32:44.5935930Z self = <triton.compiler.compiler.ASTSource object at 0x7f08990e4fd0>
2025-05-07T20:32:44.5937046Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.5938485Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03f740>}
2025-05-07T20:32:44.5939874Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.5940929Z context = <triton._C.libtriton.ir.context object at 0x7f07ad2b6870>
2025-05-07T20:32:44.5941229Z 
2025-05-07T20:32:44.5941405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.5941949Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.5942434Z                            module_map=module_map)
2025-05-07T20:32:44.5942814Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.5943176Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.5943444Z E       ^
2025-05-07T20:32:44.5943923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.5944389Z 
2025-05-07T20:32:44.5944818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.5945353Z 
2025-05-07T20:32:44.5945460Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.5945890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.5946312Z     T=4096,
2025-05-07T20:32:44.5946553Z     D=5120,
2025-05-07T20:32:44.5946759Z     scale_ub=None,
2025-05-07T20:32:44.5946984Z     contiguous=False,
2025-05-07T20:32:44.5947214Z     compiled=True,
2025-05-07T20:32:44.5947423Z )
2025-05-07T20:32:44.5947756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.5948263Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:44.5948549Z 
2025-05-07T20:32:44.5948630Z     @given(
2025-05-07T20:32:44.5948872Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.5949269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.5949583Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.5949926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.5950268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.5950560Z     )
2025-05-07T20:32:44.5951026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.5951528Z     def test_silu_mul_quant(
2025-05-07T20:32:44.5951774Z         self,
2025-05-07T20:32:44.5951978Z         T: int,
2025-05-07T20:32:44.5952184Z         D: int,
2025-05-07T20:32:44.5952409Z         scale_ub: Optional[float],
2025-05-07T20:32:44.5952693Z         contiguous: bool,
2025-05-07T20:32:44.5952943Z         compiled: bool,
2025-05-07T20:32:44.5953170Z     ) -> None:
2025-05-07T20:32:44.5953396Z         torch.manual_seed(2025)
2025-05-07T20:32:44.5953649Z     
2025-05-07T20:32:44.5953934Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.5954285Z     
2025-05-07T20:32:44.5954487Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.5954794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.5955112Z         x = x_sign * x_clamp
2025-05-07T20:32:44.5955363Z         x0 = x[:, :D]
2025-05-07T20:32:44.5955597Z         x1 = x[:, D:]
2025-05-07T20:32:44.5955811Z     
2025-05-07T20:32:44.5956011Z         if contiguous:
2025-05-07T20:32:44.5956254Z             x0 = x0.contiguous()
2025-05-07T20:32:44.5956517Z             x1 = x1.contiguous()
2025-05-07T20:32:44.5956766Z     
2025-05-07T20:32:44.5956966Z         if scale_ub is not None:
2025-05-07T20:32:44.5957243Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.5957590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.5957915Z             )
2025-05-07T20:32:44.5958110Z         else:
2025-05-07T20:32:44.5958328Z             scale_ub_tensor = None
2025-05-07T20:32:44.5958595Z     
2025-05-07T20:32:44.5958842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.5959167Z             op = silu_mul_quant
2025-05-07T20:32:44.5959431Z             if compiled:
2025-05-07T20:32:44.5959695Z                 op = torch.compile(op)
2025-05-07T20:32:44.5959999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.5960385Z     
2025-05-07T20:32:44.5960590Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.5960764Z 
2025-05-07T20:32:44.5960869Z moe/activation_test.py:117: 
2025-05-07T20:32:44.5961181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.5961533Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.5961828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.5962421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.5963008Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.5963693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.5964405Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.5964963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.5965676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.5966419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.5966970Z     kernel = self.compile(
2025-05-07T20:32:44.5967532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.5968212Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.5968625Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.5968869Z 
2025-05-07T20:32:44.5969127Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ada1c750>
2025-05-07T20:32:44.5970249Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.5971779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad290c20>}
2025-05-07T20:32:44.5973170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.5974222Z context = <triton._C.libtriton.ir.context object at 0x7f07ad2ca630>
2025-05-07T20:32:44.5974534Z 
2025-05-07T20:32:44.5974709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.5975261Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.5975753Z                            module_map=module_map)
2025-05-07T20:32:44.5976130Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.5976502Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.5976781Z E       ^
2025-05-07T20:32:44.5977265Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.5977742Z 
2025-05-07T20:32:44.5978174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.7355755Z 
2025-05-07T20:32:44.7356539Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.7357265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.7357866Z     T=4096,
2025-05-07T20:32:44.7358143Z     D=5120,
2025-05-07T20:32:44.7358426Z     scale_ub=1200.0,
2025-05-07T20:32:44.7358730Z     contiguous=False,
2025-05-07T20:32:44.7359039Z     compiled=False,
2025-05-07T20:32:44.7359313Z )
2025-05-07T20:32:44.7359678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.7360287Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.7360590Z 
2025-05-07T20:32:44.7360677Z     @given(
2025-05-07T20:32:44.7360929Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.7361265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.7361593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.7361943Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.7362295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.7362605Z     )
2025-05-07T20:32:44.7362973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.7363450Z     def test_silu_mul_quant(
2025-05-07T20:32:44.7363704Z         self,
2025-05-07T20:32:44.7363903Z         T: int,
2025-05-07T20:32:44.7364113Z         D: int,
2025-05-07T20:32:44.7364345Z         scale_ub: Optional[float],
2025-05-07T20:32:44.7364627Z         contiguous: bool,
2025-05-07T20:32:44.7364884Z         compiled: bool,
2025-05-07T20:32:44.7365132Z     ) -> None:
2025-05-07T20:32:44.7365633Z         torch.manual_seed(2025)
2025-05-07T20:32:44.7365895Z     
2025-05-07T20:32:44.7366188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.7366548Z     
2025-05-07T20:32:44.7366751Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.7367062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.7367394Z         x = x_sign * x_clamp
2025-05-07T20:32:44.7367645Z         x0 = x[:, :D]
2025-05-07T20:32:44.7367876Z         x1 = x[:, D:]
2025-05-07T20:32:44.7368097Z     
2025-05-07T20:32:44.7368288Z         if contiguous:
2025-05-07T20:32:44.7368627Z             x0 = x0.contiguous()
2025-05-07T20:32:44.7368902Z             x1 = x1.contiguous()
2025-05-07T20:32:44.7369152Z     
2025-05-07T20:32:44.7369356Z         if scale_ub is not None:
2025-05-07T20:32:44.7369646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.7369999Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.7370427Z             )
2025-05-07T20:32:44.7370643Z         else:
2025-05-07T20:32:44.7370946Z             scale_ub_tensor = None
2025-05-07T20:32:44.7371221Z     
2025-05-07T20:32:44.7371465Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.7371794Z             op = silu_mul_quant
2025-05-07T20:32:44.7372062Z             if compiled:
2025-05-07T20:32:44.7372323Z                 op = torch.compile(op)
2025-05-07T20:32:44.7372637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.7372924Z     
2025-05-07T20:32:44.7373128Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.7373303Z 
2025-05-07T20:32:44.7373417Z moe/activation_test.py:117: 
2025-05-07T20:32:44.7373724Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.7374079Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.7374383Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.7375124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.7375872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.7376454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.7377194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.7377913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.7378486Z     kernel = self.compile(
2025-05-07T20:32:44.7379070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.7379780Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.7380201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.7380453Z 
2025-05-07T20:32:44.7387626Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad751dd0>
2025-05-07T20:32:44.7388843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.7390315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad2916c0>}
2025-05-07T20:32:44.7391741Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.7392847Z context = <triton._C.libtriton.ir.context object at 0x7f07ad2034f0>
2025-05-07T20:32:44.7393158Z 
2025-05-07T20:32:44.7393352Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.7393984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.7394495Z                            module_map=module_map)
2025-05-07T20:32:44.7394890Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.7395273Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.7395546Z E       ^
2025-05-07T20:32:44.7396045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.7396527Z 
2025-05-07T20:32:44.7396979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.7397569Z 
2025-05-07T20:32:44.7397693Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.7398130Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.7398564Z     T=4096,
2025-05-07T20:32:44.7398766Z     D=5120,
2025-05-07T20:32:44.7399012Z     scale_ub=1200.0,
2025-05-07T20:32:44.7399259Z     contiguous=False,
2025-05-07T20:32:44.7399541Z     compiled=True,
2025-05-07T20:32:44.7399756Z )
2025-05-07T20:32:44.7400204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.7400742Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.7401031Z 
2025-05-07T20:32:44.7401112Z     @given(
2025-05-07T20:32:44.7401360Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.7401694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.7402020Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.7402369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.7402721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.7403025Z     )
2025-05-07T20:32:44.7403393Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.7403865Z     def test_silu_mul_quant(
2025-05-07T20:32:44.7404127Z         self,
2025-05-07T20:32:44.7404331Z         T: int,
2025-05-07T20:32:44.7404545Z         D: int,
2025-05-07T20:32:44.7404779Z         scale_ub: Optional[float],
2025-05-07T20:32:44.7405061Z         contiguous: bool,
2025-05-07T20:32:44.7405319Z         compiled: bool,
2025-05-07T20:32:44.7405558Z     ) -> None:
2025-05-07T20:32:44.7405784Z         torch.manual_seed(2025)
2025-05-07T20:32:44.7406042Z     
2025-05-07T20:32:44.7406332Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.7406696Z     
2025-05-07T20:32:44.7406897Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.7407209Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.7407541Z         x = x_sign * x_clamp
2025-05-07T20:32:44.7407791Z         x0 = x[:, :D]
2025-05-07T20:32:44.7408025Z         x1 = x[:, D:]
2025-05-07T20:32:44.7408250Z     
2025-05-07T20:32:44.7408441Z         if contiguous:
2025-05-07T20:32:44.7408691Z             x0 = x0.contiguous()
2025-05-07T20:32:44.7408971Z             x1 = x1.contiguous()
2025-05-07T20:32:44.7409219Z     
2025-05-07T20:32:44.7409431Z         if scale_ub is not None:
2025-05-07T20:32:44.7409724Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.7410079Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.7410411Z             )
2025-05-07T20:32:44.7410618Z         else:
2025-05-07T20:32:44.7410840Z             scale_ub_tensor = None
2025-05-07T20:32:44.7411109Z     
2025-05-07T20:32:44.7411354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.7411690Z             op = silu_mul_quant
2025-05-07T20:32:44.7411954Z             if compiled:
2025-05-07T20:32:44.7412219Z                 op = torch.compile(op)
2025-05-07T20:32:44.7412538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.7412824Z     
2025-05-07T20:32:44.7413032Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.7413207Z 
2025-05-07T20:32:44.7413631Z moe/activation_test.py:117: 
2025-05-07T20:32:44.7414051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.7414409Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.7414709Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.7415294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.7415885Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.7416579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.7417376Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.7417936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.7418661Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.7419364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.7420058Z     kernel = self.compile(
2025-05-07T20:32:44.7420628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.7421325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.7421751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.7421997Z 
2025-05-07T20:32:44.7422214Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b21ed0>
2025-05-07T20:32:44.7423357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.7424815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad292fc0>}
2025-05-07T20:32:44.7426242Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.7427329Z context = <triton._C.libtriton.ir.context object at 0x7f07acfafcb0>
2025-05-07T20:32:44.7427635Z 
2025-05-07T20:32:44.7427810Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.7428365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.7428865Z                            module_map=module_map)
2025-05-07T20:32:44.7429254Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.7429624Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.7429900Z E       ^
2025-05-07T20:32:44.7430395Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.7430877Z 
2025-05-07T20:32:44.7431323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.7431873Z 
2025-05-07T20:32:44.7431981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.7432440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.7432873Z     T=2048,
2025-05-07T20:32:44.7433064Z     D=7168,
2025-05-07T20:32:44.7433267Z     scale_ub=1200.0,
2025-05-07T20:32:44.7433503Z     contiguous=False,
2025-05-07T20:32:44.7433737Z     compiled=False,
2025-05-07T20:32:44.9379366Z )
2025-05-07T20:32:44.9379985Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9380720Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:44.9381032Z 
2025-05-07T20:32:44.9381118Z     @given(
2025-05-07T20:32:44.9381381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9382002Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9382341Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9382687Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9383036Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9383341Z     )
2025-05-07T20:32:44.9383705Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9384178Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9384438Z         self,
2025-05-07T20:32:44.9384642Z         T: int,
2025-05-07T20:32:44.9384941Z         D: int,
2025-05-07T20:32:44.9385173Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9385455Z         contiguous: bool,
2025-05-07T20:32:44.9385709Z         compiled: bool,
2025-05-07T20:32:44.9385951Z     ) -> None:
2025-05-07T20:32:44.9386177Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9386557Z     
2025-05-07T20:32:44.9386853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9387213Z     
2025-05-07T20:32:44.9387486Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9387802Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9388132Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9388384Z         x0 = x[:, :D]
2025-05-07T20:32:44.9388623Z         x1 = x[:, D:]
2025-05-07T20:32:44.9388859Z     
2025-05-07T20:32:44.9389058Z         if contiguous:
2025-05-07T20:32:44.9389307Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9389585Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9389842Z     
2025-05-07T20:32:44.9390047Z         if scale_ub is not None:
2025-05-07T20:32:44.9390335Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9390687Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9391020Z             )
2025-05-07T20:32:44.9391228Z         else:
2025-05-07T20:32:44.9391455Z             scale_ub_tensor = None
2025-05-07T20:32:44.9391725Z     
2025-05-07T20:32:44.9391976Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9392310Z             op = silu_mul_quant
2025-05-07T20:32:44.9392573Z             if compiled:
2025-05-07T20:32:44.9392836Z                 op = torch.compile(op)
2025-05-07T20:32:44.9393152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9393440Z     
2025-05-07T20:32:44.9393652Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9393824Z 
2025-05-07T20:32:44.9393939Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9394248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9394606Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9394914Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9395631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9396356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9396937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9397658Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9398358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9398912Z     kernel = self.compile(
2025-05-07T20:32:44.9399481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9400296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9400716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9400964Z 
2025-05-07T20:32:44.9401178Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b21e50>
2025-05-07T20:32:44.9402362Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9403817Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad293ec0>}
2025-05-07T20:32:44.9405230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9406336Z context = <triton._C.libtriton.ir.context object at 0x7f07ad605fb0>
2025-05-07T20:32:44.9406648Z 
2025-05-07T20:32:44.9406825Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9407378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9407913Z                            module_map=module_map)
2025-05-07T20:32:44.9408331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9408710Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9408983Z E       ^
2025-05-07T20:32:44.9409467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9409947Z 
2025-05-07T20:32:44.9410386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9410929Z 
2025-05-07T20:32:44.9411038Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9411477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9411894Z     T=1,
2025-05-07T20:32:44.9412094Z     D=7168,
2025-05-07T20:32:44.9412298Z     scale_ub=None,
2025-05-07T20:32:44.9412520Z     contiguous=True,
2025-05-07T20:32:44.9412759Z     compiled=False,
2025-05-07T20:32:44.9412973Z )
2025-05-07T20:32:44.9413629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9414181Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:44.9414461Z 
2025-05-07T20:32:44.9414549Z     @given(
2025-05-07T20:32:44.9414811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9415148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9415470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9415829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9416183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9416483Z     )
2025-05-07T20:32:44.9416853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9417322Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9417575Z         self,
2025-05-07T20:32:44.9417794Z         T: int,
2025-05-07T20:32:44.9418007Z         D: int,
2025-05-07T20:32:44.9418232Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9418520Z         contiguous: bool,
2025-05-07T20:32:44.9418774Z         compiled: bool,
2025-05-07T20:32:44.9419006Z     ) -> None:
2025-05-07T20:32:44.9419237Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9419497Z     
2025-05-07T20:32:44.9419780Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9420147Z     
2025-05-07T20:32:44.9420360Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9420668Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9420989Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9421241Z         x0 = x[:, :D]
2025-05-07T20:32:44.9421469Z         x1 = x[:, D:]
2025-05-07T20:32:44.9421684Z     
2025-05-07T20:32:44.9421879Z         if contiguous:
2025-05-07T20:32:44.9422121Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9422387Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9422643Z     
2025-05-07T20:32:44.9422843Z         if scale_ub is not None:
2025-05-07T20:32:44.9423235Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9423596Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9423925Z             )
2025-05-07T20:32:44.9424201Z         else:
2025-05-07T20:32:44.9424482Z             scale_ub_tensor = None
2025-05-07T20:32:44.9424753Z     
2025-05-07T20:32:44.9424992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9425355Z             op = silu_mul_quant
2025-05-07T20:32:44.9425616Z             if compiled:
2025-05-07T20:32:44.9425876Z                 op = torch.compile(op)
2025-05-07T20:32:44.9426813Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9427095Z     
2025-05-07T20:32:44.9427298Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9427469Z 
2025-05-07T20:32:44.9427581Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9427885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9428311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9428665Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9429378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9430100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9430671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9431385Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9432073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9432635Z     kernel = self.compile(
2025-05-07T20:32:44.9433206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9433895Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9434316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9434564Z 
2025-05-07T20:32:44.9434781Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adffbbd0>
2025-05-07T20:32:44.9435905Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9437331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acf38cc0>}
2025-05-07T20:32:44.9438720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9439787Z context = <triton._C.libtriton.ir.context object at 0x7f07ad65b670>
2025-05-07T20:32:44.9440199Z 
2025-05-07T20:32:44.9440378Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9440929Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9441413Z                            module_map=module_map)
2025-05-07T20:32:44.9441803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9442175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9442440Z E       ^
2025-05-07T20:32:44.9442923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9443397Z 
2025-05-07T20:32:44.9443829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:44.9444362Z 
2025-05-07T20:32:44.9444477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:44.9444955Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:44.9445378Z     T=16384,
2025-05-07T20:32:44.9445584Z     D=7168,
2025-05-07T20:32:44.9445785Z     scale_ub=1200.0,
2025-05-07T20:32:44.9446018Z     contiguous=False,
2025-05-07T20:32:44.9446254Z     compiled=True,
2025-05-07T20:32:44.9446462Z )
2025-05-07T20:32:44.9446799Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:44.9447326Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:44.9447620Z 
2025-05-07T20:32:44.9447708Z     @given(
2025-05-07T20:32:44.9447992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:44.9448322Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:44.9448644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:44.9448983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:44.9449367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:44.9449667Z     )
2025-05-07T20:32:44.9450067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:44.9450536Z     def test_silu_mul_quant(
2025-05-07T20:32:44.9450790Z         self,
2025-05-07T20:32:44.9450994Z         T: int,
2025-05-07T20:32:44.9451195Z         D: int,
2025-05-07T20:32:44.9451426Z         scale_ub: Optional[float],
2025-05-07T20:32:44.9451712Z         contiguous: bool,
2025-05-07T20:32:44.9451958Z         compiled: bool,
2025-05-07T20:32:44.9452191Z     ) -> None:
2025-05-07T20:32:44.9452421Z         torch.manual_seed(2025)
2025-05-07T20:32:44.9452672Z     
2025-05-07T20:32:44.9452961Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:44.9453320Z     
2025-05-07T20:32:44.9453520Z         x_sign = torch.sign(x)
2025-05-07T20:32:44.9453830Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:44.9454159Z         x = x_sign * x_clamp
2025-05-07T20:32:44.9454408Z         x0 = x[:, :D]
2025-05-07T20:32:44.9454641Z         x1 = x[:, D:]
2025-05-07T20:32:44.9454864Z     
2025-05-07T20:32:44.9455058Z         if contiguous:
2025-05-07T20:32:44.9455304Z             x0 = x0.contiguous()
2025-05-07T20:32:44.9455583Z             x1 = x1.contiguous()
2025-05-07T20:32:44.9455837Z     
2025-05-07T20:32:44.9456046Z         if scale_ub is not None:
2025-05-07T20:32:44.9456337Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:44.9456701Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:44.9457025Z             )
2025-05-07T20:32:44.9457234Z         else:
2025-05-07T20:32:44.9457462Z             scale_ub_tensor = None
2025-05-07T20:32:44.9457724Z     
2025-05-07T20:32:44.9457971Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:44.9458307Z             op = silu_mul_quant
2025-05-07T20:32:44.9458566Z             if compiled:
2025-05-07T20:32:44.9458828Z                 op = torch.compile(op)
2025-05-07T20:32:44.9459150Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9459438Z     
2025-05-07T20:32:44.9459659Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:44.9459834Z 
2025-05-07T20:32:44.9459945Z moe/activation_test.py:117: 
2025-05-07T20:32:44.9460254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9460605Z moe/activation_test.py:115: in fn
2025-05-07T20:32:44.9460906Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:44.9461489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:44.9462066Z     return fn(*args, **kwargs)
2025-05-07T20:32:44.9462758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:44.9463478Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:44.9464038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:44.9464811Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:44.9465513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:44.9466075Z     kernel = self.compile(
2025-05-07T20:32:44.9466638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:44.9467330Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:44.9467753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:44.9468065Z 
2025-05-07T20:32:44.9468290Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf56750>
2025-05-07T20:32:44.9469548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:44.9471074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acf3a0c0>}
2025-05-07T20:32:44.9472476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:44.9473543Z context = <triton._C.libtriton.ir.context object at 0x7f07ad6d2330>
2025-05-07T20:32:44.9473848Z 
2025-05-07T20:32:44.9474038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:44.9474587Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:44.9475086Z                            module_map=module_map)
2025-05-07T20:32:44.9475473Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:44.9475848Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:44.9476124Z E       ^
2025-05-07T20:32:44.9476613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:44.9477082Z 
2025-05-07T20:32:44.9477524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.0809071Z 
2025-05-07T20:32:45.0809773Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.0810474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.0811086Z     T=1,
2025-05-07T20:32:45.0811361Z     D=7168,
2025-05-07T20:32:45.0811578Z     scale_ub=None,
2025-05-07T20:32:45.0811809Z     contiguous=False,
2025-05-07T20:32:45.0812055Z     compiled=False,
2025-05-07T20:32:45.0812271Z )
2025-05-07T20:32:45.0812611Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.0813140Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:45.0813664Z 
2025-05-07T20:32:45.0813753Z     @given(
2025-05-07T20:32:45.0813994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.0814326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.0814656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.0815004Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.0815353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.0815655Z     )
2025-05-07T20:32:45.0816022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.0816492Z     def test_silu_mul_quant(
2025-05-07T20:32:45.0816747Z         self,
2025-05-07T20:32:45.0816953Z         T: int,
2025-05-07T20:32:45.0817164Z         D: int,
2025-05-07T20:32:45.0817399Z         scale_ub: Optional[float],
2025-05-07T20:32:45.0817686Z         contiguous: bool,
2025-05-07T20:32:45.0817944Z         compiled: bool,
2025-05-07T20:32:45.0818514Z     ) -> None:
2025-05-07T20:32:45.0818750Z         torch.manual_seed(2025)
2025-05-07T20:32:45.0819002Z     
2025-05-07T20:32:45.0819291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.0819650Z     
2025-05-07T20:32:45.0819855Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.0820165Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.0820495Z         x = x_sign * x_clamp
2025-05-07T20:32:45.0820746Z         x0 = x[:, :D]
2025-05-07T20:32:45.0820976Z         x1 = x[:, D:]
2025-05-07T20:32:45.0821276Z     
2025-05-07T20:32:45.0828371Z         if contiguous:
2025-05-07T20:32:45.0828661Z             x0 = x0.contiguous()
2025-05-07T20:32:45.0828942Z             x1 = x1.contiguous()
2025-05-07T20:32:45.0829206Z     
2025-05-07T20:32:45.0829414Z         if scale_ub is not None:
2025-05-07T20:32:45.0829701Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.0830203Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.0830540Z             )
2025-05-07T20:32:45.0830826Z         else:
2025-05-07T20:32:45.0831059Z             scale_ub_tensor = None
2025-05-07T20:32:45.0831336Z     
2025-05-07T20:32:45.0831583Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.0831927Z             op = silu_mul_quant
2025-05-07T20:32:45.0832199Z             if compiled:
2025-05-07T20:32:45.0832459Z                 op = torch.compile(op)
2025-05-07T20:32:45.0832780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.0833075Z     
2025-05-07T20:32:45.0833281Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.0833466Z 
2025-05-07T20:32:45.0833573Z moe/activation_test.py:117: 
2025-05-07T20:32:45.0833891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.0834255Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.0834554Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.0835298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.0836021Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.0836586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.0837310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.0838017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.0838579Z     kernel = self.compile(
2025-05-07T20:32:45.0839150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.0839848Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.0840346Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.0840593Z 
2025-05-07T20:32:45.0840825Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adff92d0>
2025-05-07T20:32:45.0841950Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.0843402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acf3ac00>}
2025-05-07T20:32:45.0844803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.0845884Z context = <triton._C.libtriton.ir.context object at 0x7f0898ba6d70>
2025-05-07T20:32:45.0846185Z 
2025-05-07T20:32:45.0846364Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.0846974Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.0847467Z                            module_map=module_map)
2025-05-07T20:32:45.0847854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.0848247Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.0848552Z E       ^
2025-05-07T20:32:45.0849042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.0849512Z 
2025-05-07T20:32:45.0849999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.0850540Z 
2025-05-07T20:32:45.0850650Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.0851115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.0851583Z     T=2048,
2025-05-07T20:32:45.0851786Z     D=7168,
2025-05-07T20:32:45.0851989Z     scale_ub=None,
2025-05-07T20:32:45.0852278Z     contiguous=False,
2025-05-07T20:32:45.0852521Z     compiled=True,
2025-05-07T20:32:45.0852734Z )
2025-05-07T20:32:45.0853074Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.0853599Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:45.0853883Z 
2025-05-07T20:32:45.0853964Z     @given(
2025-05-07T20:32:45.0854212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.0854548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.0854880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.0855225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.0855579Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.0855883Z     )
2025-05-07T20:32:45.0856250Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.0856724Z     def test_silu_mul_quant(
2025-05-07T20:32:45.0856985Z         self,
2025-05-07T20:32:45.0857194Z         T: int,
2025-05-07T20:32:45.0857410Z         D: int,
2025-05-07T20:32:45.0857645Z         scale_ub: Optional[float],
2025-05-07T20:32:45.0857929Z         contiguous: bool,
2025-05-07T20:32:45.0858185Z         compiled: bool,
2025-05-07T20:32:45.0858421Z     ) -> None:
2025-05-07T20:32:45.0858645Z         torch.manual_seed(2025)
2025-05-07T20:32:45.0858906Z     
2025-05-07T20:32:45.0859201Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.0859571Z     
2025-05-07T20:32:45.0859777Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.0860088Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.0860423Z         x = x_sign * x_clamp
2025-05-07T20:32:45.0860673Z         x0 = x[:, :D]
2025-05-07T20:32:45.0860906Z         x1 = x[:, D:]
2025-05-07T20:32:45.0861137Z     
2025-05-07T20:32:45.0861330Z         if contiguous:
2025-05-07T20:32:45.0861582Z             x0 = x0.contiguous()
2025-05-07T20:32:45.0861861Z             x1 = x1.contiguous()
2025-05-07T20:32:45.0862112Z     
2025-05-07T20:32:45.0862318Z         if scale_ub is not None:
2025-05-07T20:32:45.0862609Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.0862964Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.0863295Z             )
2025-05-07T20:32:45.0863502Z         else:
2025-05-07T20:32:45.0863717Z             scale_ub_tensor = None
2025-05-07T20:32:45.0863985Z     
2025-05-07T20:32:45.0864230Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.0864568Z             op = silu_mul_quant
2025-05-07T20:32:45.0864828Z             if compiled:
2025-05-07T20:32:45.0865093Z                 op = torch.compile(op)
2025-05-07T20:32:45.0865409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.0865695Z     
2025-05-07T20:32:45.0865911Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.0866085Z 
2025-05-07T20:32:45.0866249Z moe/activation_test.py:117: 
2025-05-07T20:32:45.0866566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.0866922Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.0867230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.0867809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.0868386Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.0869074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.0869826Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.0870384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.0871105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.0871893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.0872460Z     kernel = self.compile(
2025-05-07T20:32:45.0873026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.0873723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.0874155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.0874398Z 
2025-05-07T20:32:45.0874621Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa6e50>
2025-05-07T20:32:45.0875751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.0877188Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8c2c0>}
2025-05-07T20:32:45.0878649Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.0879721Z context = <triton._C.libtriton.ir.context object at 0x7f0898bf7a70>
2025-05-07T20:32:45.0880026Z 
2025-05-07T20:32:45.0880294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.0880852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.0881350Z                            module_map=module_map)
2025-05-07T20:32:45.0881739Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.0882106Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.0882386Z E       ^
2025-05-07T20:32:45.0882882Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.0883349Z 
2025-05-07T20:32:45.0883783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.0884325Z 
2025-05-07T20:32:45.0884435Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.0884876Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.0885303Z     T=4096,
2025-05-07T20:32:45.0885498Z     D=7168,
2025-05-07T20:32:45.0885705Z     scale_ub=None,
2025-05-07T20:32:45.0885938Z     contiguous=False,
2025-05-07T20:32:45.0886173Z     compiled=True,
2025-05-07T20:32:45.4934868Z )
2025-05-07T20:32:45.4935436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.4936167Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:45.4936590Z 
2025-05-07T20:32:45.4936702Z     @given(
2025-05-07T20:32:45.4937317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.4937661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.4937990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.4938346Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.4938734Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.4939044Z     )
2025-05-07T20:32:45.4939412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.4939920Z     def test_silu_mul_quant(
2025-05-07T20:32:45.4940257Z         self,
2025-05-07T20:32:45.4940464Z         T: int,
2025-05-07T20:32:45.4940666Z         D: int,
2025-05-07T20:32:45.4940893Z         scale_ub: Optional[float],
2025-05-07T20:32:45.4941178Z         contiguous: bool,
2025-05-07T20:32:45.4941427Z         compiled: bool,
2025-05-07T20:32:45.4941666Z     ) -> None:
2025-05-07T20:32:45.4941968Z         torch.manual_seed(2025)
2025-05-07T20:32:45.4942216Z     
2025-05-07T20:32:45.4942578Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.4942943Z     
2025-05-07T20:32:45.4943142Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.4943450Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.4943775Z         x = x_sign * x_clamp
2025-05-07T20:32:45.4944020Z         x0 = x[:, :D]
2025-05-07T20:32:45.4944249Z         x1 = x[:, D:]
2025-05-07T20:32:45.4944465Z     
2025-05-07T20:32:45.4944659Z         if contiguous:
2025-05-07T20:32:45.4944895Z             x0 = x0.contiguous()
2025-05-07T20:32:45.4945165Z             x1 = x1.contiguous()
2025-05-07T20:32:45.4945414Z     
2025-05-07T20:32:45.4945608Z         if scale_ub is not None:
2025-05-07T20:32:45.4945891Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.4946248Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.4946567Z             )
2025-05-07T20:32:45.4946771Z         else:
2025-05-07T20:32:45.4946997Z             scale_ub_tensor = None
2025-05-07T20:32:45.4947253Z     
2025-05-07T20:32:45.4947499Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.4947828Z             op = silu_mul_quant
2025-05-07T20:32:45.4948103Z             if compiled:
2025-05-07T20:32:45.4948355Z                 op = torch.compile(op)
2025-05-07T20:32:45.4948667Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.4948957Z     
2025-05-07T20:32:45.4949172Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.4949343Z 
2025-05-07T20:32:45.4949449Z moe/activation_test.py:117: 
2025-05-07T20:32:45.4949763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.4950114Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.4950406Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.4950992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.4951577Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.4952262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.4952976Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.4953535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.4954251Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.4954945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.4955504Z     kernel = self.compile(
2025-05-07T20:32:45.4956068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.4956756Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.4957225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.4957470Z 
2025-05-07T20:32:45.4957687Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adff82d0>
2025-05-07T20:32:45.4958814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.4960343Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8cd60>}
2025-05-07T20:32:45.4961770Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.4962844Z context = <triton._C.libtriton.ir.context object at 0x7f0898b89fb0>
2025-05-07T20:32:45.4963188Z 
2025-05-07T20:32:45.4963405Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.4963955Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.4964437Z                            module_map=module_map)
2025-05-07T20:32:45.4964827Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.4965203Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.4965473Z E       ^
2025-05-07T20:32:45.4965961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.4966441Z 
2025-05-07T20:32:45.4966875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.4967406Z 
2025-05-07T20:32:45.4967522Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.4967959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.4968393Z     T=16384,
2025-05-07T20:32:45.4968597Z     D=5120,
2025-05-07T20:32:45.4968803Z     scale_ub=1200.0,
2025-05-07T20:32:45.4969041Z     contiguous=False,
2025-05-07T20:32:45.4969281Z     compiled=False,
2025-05-07T20:32:45.4969504Z )
2025-05-07T20:32:45.4969834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.4970367Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:45.4970665Z 
2025-05-07T20:32:45.4970752Z     @given(
2025-05-07T20:32:45.4970991Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.4971325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.4971651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.4971996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.4972345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.4972655Z     )
2025-05-07T20:32:45.4973030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.4973496Z     def test_silu_mul_quant(
2025-05-07T20:32:45.4973751Z         self,
2025-05-07T20:32:45.4973961Z         T: int,
2025-05-07T20:32:45.4974165Z         D: int,
2025-05-07T20:32:45.4974398Z         scale_ub: Optional[float],
2025-05-07T20:32:45.4974687Z         contiguous: bool,
2025-05-07T20:32:45.4974936Z         compiled: bool,
2025-05-07T20:32:45.4975174Z     ) -> None:
2025-05-07T20:32:45.4975405Z         torch.manual_seed(2025)
2025-05-07T20:32:45.4975656Z     
2025-05-07T20:32:45.4975947Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.4976315Z     
2025-05-07T20:32:45.4976516Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.4976832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.4977159Z         x = x_sign * x_clamp
2025-05-07T20:32:45.4977407Z         x0 = x[:, :D]
2025-05-07T20:32:45.4977638Z         x1 = x[:, D:]
2025-05-07T20:32:45.4977856Z     
2025-05-07T20:32:45.4978095Z         if contiguous:
2025-05-07T20:32:45.4978350Z             x0 = x0.contiguous()
2025-05-07T20:32:45.4978671Z             x1 = x1.contiguous()
2025-05-07T20:32:45.4978929Z     
2025-05-07T20:32:45.4979125Z         if scale_ub is not None:
2025-05-07T20:32:45.4979413Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.4979773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.4980092Z             )
2025-05-07T20:32:45.4980300Z         else:
2025-05-07T20:32:45.4980522Z             scale_ub_tensor = None
2025-05-07T20:32:45.4980828Z     
2025-05-07T20:32:45.4981076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.4981411Z             op = silu_mul_quant
2025-05-07T20:32:45.4981669Z             if compiled:
2025-05-07T20:32:45.4981931Z                 op = torch.compile(op)
2025-05-07T20:32:45.4982246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.4982573Z     
2025-05-07T20:32:45.4982782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.4982959Z 
2025-05-07T20:32:45.4983108Z moe/activation_test.py:117: 
2025-05-07T20:32:45.4983428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.4983773Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.4984068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.4984790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.4985506Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.4986073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.4986786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.4987489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.4988050Z     kernel = self.compile(
2025-05-07T20:32:45.4988622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.4989310Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.4989723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.4989968Z 
2025-05-07T20:32:45.4990185Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b236d0>
2025-05-07T20:32:45.4991306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.4992737Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8dc60>}
2025-05-07T20:32:45.4994146Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.4995206Z context = <triton._C.libtriton.ir.context object at 0x7f07ace8adf0>
2025-05-07T20:32:45.4995517Z 
2025-05-07T20:32:45.4995693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.4996243Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.4996736Z                            module_map=module_map)
2025-05-07T20:32:45.4997121Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.4997498Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.4997774Z E       ^
2025-05-07T20:32:45.4998260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.4998737Z 
2025-05-07T20:32:45.4999223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.4999763Z 
2025-05-07T20:32:45.4999874Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.5000430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.5000843Z     T=16384,
2025-05-07T20:32:45.5001045Z     D=5120,
2025-05-07T20:32:45.5001252Z     scale_ub=1200.0,
2025-05-07T20:32:45.5001483Z     contiguous=True,
2025-05-07T20:32:45.5001717Z     compiled=True,
2025-05-07T20:32:45.5001930Z )
2025-05-07T20:32:45.5002308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.5002832Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:45.5003123Z 
2025-05-07T20:32:45.5003210Z     @given(
2025-05-07T20:32:45.5003456Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.5003826Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.5004214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.5004569Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.5004917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.5005218Z     )
2025-05-07T20:32:45.5005591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.5006048Z     def test_silu_mul_quant(
2025-05-07T20:32:45.5006302Z         self,
2025-05-07T20:32:45.5006508Z         T: int,
2025-05-07T20:32:45.5006712Z         D: int,
2025-05-07T20:32:45.5006947Z         scale_ub: Optional[float],
2025-05-07T20:32:45.5007236Z         contiguous: bool,
2025-05-07T20:32:45.5007488Z         compiled: bool,
2025-05-07T20:32:45.5007725Z     ) -> None:
2025-05-07T20:32:45.5007953Z         torch.manual_seed(2025)
2025-05-07T20:32:45.5008202Z     
2025-05-07T20:32:45.5008497Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.5008867Z     
2025-05-07T20:32:45.5009079Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.5009385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.5009715Z         x = x_sign * x_clamp
2025-05-07T20:32:45.5009971Z         x0 = x[:, :D]
2025-05-07T20:32:45.5010197Z         x1 = x[:, D:]
2025-05-07T20:32:45.5010422Z     
2025-05-07T20:32:45.5010626Z         if contiguous:
2025-05-07T20:32:45.5010865Z             x0 = x0.contiguous()
2025-05-07T20:32:45.5011144Z             x1 = x1.contiguous()
2025-05-07T20:32:45.5011408Z     
2025-05-07T20:32:45.5011606Z         if scale_ub is not None:
2025-05-07T20:32:45.5011906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.5012267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.5012586Z             )
2025-05-07T20:32:45.5012795Z         else:
2025-05-07T20:32:45.5013019Z             scale_ub_tensor = None
2025-05-07T20:32:45.5013282Z     
2025-05-07T20:32:45.5013858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.5014191Z             op = silu_mul_quant
2025-05-07T20:32:45.5014454Z             if compiled:
2025-05-07T20:32:45.5014709Z                 op = torch.compile(op)
2025-05-07T20:32:45.5015021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.5015310Z     
2025-05-07T20:32:45.5015507Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.5015688Z 
2025-05-07T20:32:45.5015792Z moe/activation_test.py:117: 
2025-05-07T20:32:45.5016102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5016449Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.5016747Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.5017329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.5017909Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.5018705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.5019426Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.5019995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.5020705Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.5021400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.5021960Z     kernel = self.compile(
2025-05-07T20:32:45.5022525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.5023322Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.5023742Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.5024041Z 
2025-05-07T20:32:45.5024272Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace69150>
2025-05-07T20:32:45.5025455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.5026875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8f380>}
2025-05-07T20:32:45.5028267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.5029332Z context = <triton._C.libtriton.ir.context object at 0x7f07ace17fb0>
2025-05-07T20:32:45.5029632Z 
2025-05-07T20:32:45.5029813Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.5030365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.5037778Z                            module_map=module_map)
2025-05-07T20:32:45.5038196Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.5038574Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.5038851Z E       ^
2025-05-07T20:32:45.5039334Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.5039818Z 
2025-05-07T20:32:45.5040328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.6573470Z 
2025-05-07T20:32:45.6573883Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.6574549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.6575068Z     T=16384,
2025-05-07T20:32:45.6575299Z     D=5120,
2025-05-07T20:32:45.6575504Z     scale_ub=None,
2025-05-07T20:32:45.6575744Z     contiguous=False,
2025-05-07T20:32:45.6575982Z     compiled=True,
2025-05-07T20:32:45.6576206Z )
2025-05-07T20:32:45.6576547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.6577071Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:45.6577374Z 
2025-05-07T20:32:45.6577456Z     @given(
2025-05-07T20:32:45.6577704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.6578033Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.6578380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.6578779Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.6579129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.6579439Z     )
2025-05-07T20:32:45.6579811Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.6580282Z     def test_silu_mul_quant(
2025-05-07T20:32:45.6580816Z         self,
2025-05-07T20:32:45.6581031Z         T: int,
2025-05-07T20:32:45.6581245Z         D: int,
2025-05-07T20:32:45.6581472Z         scale_ub: Optional[float],
2025-05-07T20:32:45.6581762Z         contiguous: bool,
2025-05-07T20:32:45.6582019Z         compiled: bool,
2025-05-07T20:32:45.6582251Z     ) -> None:
2025-05-07T20:32:45.6582513Z         torch.manual_seed(2025)
2025-05-07T20:32:45.6582763Z     
2025-05-07T20:32:45.6583051Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.6583413Z     
2025-05-07T20:32:45.6583612Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.6584009Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.6584340Z         x = x_sign * x_clamp
2025-05-07T20:32:45.6584588Z         x0 = x[:, :D]
2025-05-07T20:32:45.6584819Z         x1 = x[:, D:]
2025-05-07T20:32:45.6585041Z     
2025-05-07T20:32:45.6585233Z         if contiguous:
2025-05-07T20:32:45.6585563Z             x0 = x0.contiguous()
2025-05-07T20:32:45.6585847Z             x1 = x1.contiguous()
2025-05-07T20:32:45.6586168Z     
2025-05-07T20:32:45.6586379Z         if scale_ub is not None:
2025-05-07T20:32:45.6586671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.6587020Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.6587358Z             )
2025-05-07T20:32:45.6587567Z         else:
2025-05-07T20:32:45.6587794Z             scale_ub_tensor = None
2025-05-07T20:32:45.6588054Z     
2025-05-07T20:32:45.6588311Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.6588651Z             op = silu_mul_quant
2025-05-07T20:32:45.6588912Z             if compiled:
2025-05-07T20:32:45.6589178Z                 op = torch.compile(op)
2025-05-07T20:32:45.6589492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.6589780Z     
2025-05-07T20:32:45.6589991Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.6590170Z 
2025-05-07T20:32:45.6590286Z moe/activation_test.py:117: 
2025-05-07T20:32:45.6590605Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.6590963Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.6591264Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.6591855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.6592439Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.6593134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.6593858Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.6594422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.6595140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.6595852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.6596414Z     kernel = self.compile(
2025-05-07T20:32:45.6596979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.6597671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.6598091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.6598330Z 
2025-05-07T20:32:45.6598552Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf542d0>
2025-05-07T20:32:45.6599692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.6601288Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2c5e0>}
2025-05-07T20:32:45.6602689Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.6603752Z context = <triton._C.libtriton.ir.context object at 0x7f07adec78f0>
2025-05-07T20:32:45.6604052Z 
2025-05-07T20:32:45.6604229Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.6604780Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.6605315Z                            module_map=module_map)
2025-05-07T20:32:45.6605704Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.6606072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.6606347Z E       ^
2025-05-07T20:32:45.6606838Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.6607347Z 
2025-05-07T20:32:45.6607818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.6608360Z 
2025-05-07T20:32:45.6608469Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.6608905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.6609328Z     T=2048,
2025-05-07T20:32:45.6609520Z     D=5120,
2025-05-07T20:32:45.6609726Z     scale_ub=None,
2025-05-07T20:32:45.6609955Z     contiguous=False,
2025-05-07T20:32:45.6610190Z     compiled=True,
2025-05-07T20:32:45.6610411Z )
2025-05-07T20:32:45.6610747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.6611260Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:45.6611550Z 
2025-05-07T20:32:45.6611631Z     @given(
2025-05-07T20:32:45.6611883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.6612219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.6612541Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.6612891Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.6613242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.6613852Z     )
2025-05-07T20:32:45.6614229Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.6614695Z     def test_silu_mul_quant(
2025-05-07T20:32:45.6614946Z         self,
2025-05-07T20:32:45.6615154Z         T: int,
2025-05-07T20:32:45.6615373Z         D: int,
2025-05-07T20:32:45.6615600Z         scale_ub: Optional[float],
2025-05-07T20:32:45.6615887Z         contiguous: bool,
2025-05-07T20:32:45.6616142Z         compiled: bool,
2025-05-07T20:32:45.6616370Z     ) -> None:
2025-05-07T20:32:45.6616604Z         torch.manual_seed(2025)
2025-05-07T20:32:45.6616865Z     
2025-05-07T20:32:45.6617158Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.6617517Z     
2025-05-07T20:32:45.6617726Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.6618034Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.6618354Z         x = x_sign * x_clamp
2025-05-07T20:32:45.6618614Z         x0 = x[:, :D]
2025-05-07T20:32:45.6618846Z         x1 = x[:, D:]
2025-05-07T20:32:45.6619063Z     
2025-05-07T20:32:45.6619262Z         if contiguous:
2025-05-07T20:32:45.6619510Z             x0 = x0.contiguous()
2025-05-07T20:32:45.6619779Z             x1 = x1.contiguous()
2025-05-07T20:32:45.6620038Z     
2025-05-07T20:32:45.6620248Z         if scale_ub is not None:
2025-05-07T20:32:45.6620531Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.6620886Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.6621212Z             )
2025-05-07T20:32:45.6621409Z         else:
2025-05-07T20:32:45.6621637Z             scale_ub_tensor = None
2025-05-07T20:32:45.6621903Z     
2025-05-07T20:32:45.6622225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.6622561Z             op = silu_mul_quant
2025-05-07T20:32:45.6622827Z             if compiled:
2025-05-07T20:32:45.6623092Z                 op = torch.compile(op)
2025-05-07T20:32:45.6623398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.6623689Z     
2025-05-07T20:32:45.6623896Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.6624070Z 
2025-05-07T20:32:45.6624181Z moe/activation_test.py:117: 
2025-05-07T20:32:45.6624496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.6624936Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.6625236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.6625828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.6626506Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.6627253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.6627967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.6628535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.6629247Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.6629934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.6630500Z     kernel = self.compile(
2025-05-07T20:32:45.6631069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.6631761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.6632179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.6632423Z 
2025-05-07T20:32:45.6632642Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace68cd0>
2025-05-07T20:32:45.6633761Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.6635187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2d440>}
2025-05-07T20:32:45.6636579Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.6637633Z context = <triton._C.libtriton.ir.context object at 0x7f07ade97030>
2025-05-07T20:32:45.6637946Z 
2025-05-07T20:32:45.6638125Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.6638676Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.6639167Z                            module_map=module_map)
2025-05-07T20:32:45.6639543Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.6639917Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.6640305Z E       ^
2025-05-07T20:32:45.6640782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.6641256Z 
2025-05-07T20:32:45.6641690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.8247583Z 
2025-05-07T20:32:45.8247980Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.8248643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.8249259Z     T=2048,
2025-05-07T20:32:45.8249534Z     D=5120,
2025-05-07T20:32:45.8250104Z     scale_ub=1200.0,
2025-05-07T20:32:45.8250363Z     contiguous=False,
2025-05-07T20:32:45.8250603Z     compiled=True,
2025-05-07T20:32:45.8250826Z )
2025-05-07T20:32:45.8251173Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.8251702Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:45.8251990Z 
2025-05-07T20:32:45.8252073Z     @given(
2025-05-07T20:32:45.8252323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.8252656Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.8253064Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.8253417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.8253769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.8254070Z     )
2025-05-07T20:32:45.8254445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.8254998Z     def test_silu_mul_quant(
2025-05-07T20:32:45.8255322Z         self,
2025-05-07T20:32:45.8255527Z         T: int,
2025-05-07T20:32:45.8255740Z         D: int,
2025-05-07T20:32:45.8255973Z         scale_ub: Optional[float],
2025-05-07T20:32:45.8256257Z         contiguous: bool,
2025-05-07T20:32:45.8256513Z         compiled: bool,
2025-05-07T20:32:45.8256753Z     ) -> None:
2025-05-07T20:32:45.8256976Z         torch.manual_seed(2025)
2025-05-07T20:32:45.8257232Z     
2025-05-07T20:32:45.8257522Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.8257881Z     
2025-05-07T20:32:45.8258089Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.8258398Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.8258722Z         x = x_sign * x_clamp
2025-05-07T20:32:45.8258978Z         x0 = x[:, :D]
2025-05-07T20:32:45.8259208Z         x1 = x[:, D:]
2025-05-07T20:32:45.8259430Z     
2025-05-07T20:32:45.8259633Z         if contiguous:
2025-05-07T20:32:45.8259882Z             x0 = x0.contiguous()
2025-05-07T20:32:45.8260158Z             x1 = x1.contiguous()
2025-05-07T20:32:45.8260420Z     
2025-05-07T20:32:45.8260625Z         if scale_ub is not None:
2025-05-07T20:32:45.8260919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.8261272Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.8261601Z             )
2025-05-07T20:32:45.8261811Z         else:
2025-05-07T20:32:45.8262032Z             scale_ub_tensor = None
2025-05-07T20:32:45.8262300Z     
2025-05-07T20:32:45.8262550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.8262918Z             op = silu_mul_quant
2025-05-07T20:32:45.8263178Z             if compiled:
2025-05-07T20:32:45.8263446Z                 op = torch.compile(op)
2025-05-07T20:32:45.8263763Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.8264050Z     
2025-05-07T20:32:45.8264261Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.8264438Z 
2025-05-07T20:32:45.8264559Z moe/activation_test.py:117: 
2025-05-07T20:32:45.8264874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8265229Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.8265531Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.8266124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.8266710Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.8267403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.8268127Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.8268688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.8269403Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.8270162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.8270727Z     kernel = self.compile(
2025-05-07T20:32:45.8271294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.8271988Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.8272413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8272654Z 
2025-05-07T20:32:45.8272888Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898bbeb50>
2025-05-07T20:32:45.8274050Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.8275577Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2e660>}
2025-05-07T20:32:45.8276976Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.8278041Z context = <triton._C.libtriton.ir.context object at 0x7f07ade3a6f0>
2025-05-07T20:32:45.8278342Z 
2025-05-07T20:32:45.8278519Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.8279071Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.8279570Z                            module_map=module_map)
2025-05-07T20:32:45.8279955Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.8280450Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.8280728Z E       ^
2025-05-07T20:32:45.8281224Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.8281694Z 
2025-05-07T20:32:45.8282134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:45.8282666Z 
2025-05-07T20:32:45.8282777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:45.8283217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:45.8283643Z     T=4096,
2025-05-07T20:32:45.8283838Z     D=5120,
2025-05-07T20:32:45.8284044Z     scale_ub=1200.0,
2025-05-07T20:32:45.8284283Z     contiguous=True,
2025-05-07T20:32:45.8284512Z     compiled=True,
2025-05-07T20:32:45.8284727Z )
2025-05-07T20:32:45.8285063Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:45.8285581Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:45.8285874Z 
2025-05-07T20:32:45.8285955Z     @given(
2025-05-07T20:32:45.8286206Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:45.8286538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:45.8286859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:45.8287208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:45.8287556Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:45.8287857Z     )
2025-05-07T20:32:45.8288228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:45.8288750Z     def test_silu_mul_quant(
2025-05-07T20:32:45.8289005Z         self,
2025-05-07T20:32:45.8289216Z         T: int,
2025-05-07T20:32:45.8289429Z         D: int,
2025-05-07T20:32:45.8289658Z         scale_ub: Optional[float],
2025-05-07T20:32:45.8289949Z         contiguous: bool,
2025-05-07T20:32:45.8290209Z         compiled: bool,
2025-05-07T20:32:45.8290448Z     ) -> None:
2025-05-07T20:32:45.8290680Z         torch.manual_seed(2025)
2025-05-07T20:32:45.8290940Z     
2025-05-07T20:32:45.8291285Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:45.8291644Z     
2025-05-07T20:32:45.8291856Z         x_sign = torch.sign(x)
2025-05-07T20:32:45.8292168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:45.8292496Z         x = x_sign * x_clamp
2025-05-07T20:32:45.8292755Z         x0 = x[:, :D]
2025-05-07T20:32:45.8292990Z         x1 = x[:, D:]
2025-05-07T20:32:45.8293205Z     
2025-05-07T20:32:45.8293401Z         if contiguous:
2025-05-07T20:32:45.8293646Z             x0 = x0.contiguous()
2025-05-07T20:32:45.8293965Z             x1 = x1.contiguous()
2025-05-07T20:32:45.8294219Z     
2025-05-07T20:32:45.8294426Z         if scale_ub is not None:
2025-05-07T20:32:45.8294711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:45.8295068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:45.8295446Z             )
2025-05-07T20:32:45.8295647Z         else:
2025-05-07T20:32:45.8295875Z             scale_ub_tensor = None
2025-05-07T20:32:45.8296186Z     
2025-05-07T20:32:45.8296439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:45.8296766Z             op = silu_mul_quant
2025-05-07T20:32:45.8297033Z             if compiled:
2025-05-07T20:32:45.8297299Z                 op = torch.compile(op)
2025-05-07T20:32:45.8297606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.8297898Z     
2025-05-07T20:32:45.8298106Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:45.8298279Z 
2025-05-07T20:32:45.8298384Z moe/activation_test.py:117: 
2025-05-07T20:32:45.8298708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8299063Z moe/activation_test.py:115: in fn
2025-05-07T20:32:45.8299362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:45.8299953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:45.8300548Z     return fn(*args, **kwargs)
2025-05-07T20:32:45.8301242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:45.8301956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:45.8302524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:45.8303241Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:45.8303940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:45.8304501Z     kernel = self.compile(
2025-05-07T20:32:45.8305075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:45.8305768Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:45.8306197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:45.8306446Z 
2025-05-07T20:32:45.8306669Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa48d0>
2025-05-07T20:32:45.8307799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:45.8309229Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2f9c0>}
2025-05-07T20:32:45.8310631Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:45.8311690Z context = <triton._C.libtriton.ir.context object at 0x7f07acc75fb0>
2025-05-07T20:32:45.8312005Z 
2025-05-07T20:32:45.8312232Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:45.8312787Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:45.8313283Z                            module_map=module_map)
2025-05-07T20:32:45.8313985Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:45.8314360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:45.8314637Z E       ^
2025-05-07T20:32:45.8315125Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:45.8315727Z 
2025-05-07T20:32:45.8316163Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.0005973Z 
2025-05-07T20:32:46.0006548Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.0007559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.0007990Z     T=128,
2025-05-07T20:32:46.0008276Z     D=5120,
2025-05-07T20:32:46.0008475Z     scale_ub=1200.0,
2025-05-07T20:32:46.0008713Z     contiguous=False,
2025-05-07T20:32:46.0008950Z     compiled=True,
2025-05-07T20:32:46.0009166Z )
2025-05-07T20:32:46.0009496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.0010012Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:46.0010294Z 
2025-05-07T20:32:46.0010374Z     @given(
2025-05-07T20:32:46.0010618Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.0010955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.0011272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.0011619Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.0011970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.0012274Z     )
2025-05-07T20:32:46.0012635Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.0013099Z     def test_silu_mul_quant(
2025-05-07T20:32:46.0013604Z         self,
2025-05-07T20:32:46.0013806Z         T: int,
2025-05-07T20:32:46.0014013Z         D: int,
2025-05-07T20:32:46.0014241Z         scale_ub: Optional[float],
2025-05-07T20:32:46.0014520Z         contiguous: bool,
2025-05-07T20:32:46.0014775Z         compiled: bool,
2025-05-07T20:32:46.0015010Z     ) -> None:
2025-05-07T20:32:46.0015234Z         torch.manual_seed(2025)
2025-05-07T20:32:46.0022689Z     
2025-05-07T20:32:46.0022983Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.0023346Z     
2025-05-07T20:32:46.0023551Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.0023858Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.0024196Z         x = x_sign * x_clamp
2025-05-07T20:32:46.0024458Z         x0 = x[:, :D]
2025-05-07T20:32:46.0024686Z         x1 = x[:, D:]
2025-05-07T20:32:46.0024910Z     
2025-05-07T20:32:46.0025110Z         if contiguous:
2025-05-07T20:32:46.0025351Z             x0 = x0.contiguous()
2025-05-07T20:32:46.0025626Z             x1 = x1.contiguous()
2025-05-07T20:32:46.0025882Z     
2025-05-07T20:32:46.0026078Z         if scale_ub is not None:
2025-05-07T20:32:46.0026373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.0026732Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.0027051Z             )
2025-05-07T20:32:46.0027259Z         else:
2025-05-07T20:32:46.0027484Z             scale_ub_tensor = None
2025-05-07T20:32:46.0027750Z     
2025-05-07T20:32:46.0028000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.0028336Z             op = silu_mul_quant
2025-05-07T20:32:46.0028605Z             if compiled:
2025-05-07T20:32:46.0028865Z                 op = torch.compile(op)
2025-05-07T20:32:46.0029183Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0029482Z     
2025-05-07T20:32:46.0029818Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.0030003Z 
2025-05-07T20:32:46.0030110Z moe/activation_test.py:117: 
2025-05-07T20:32:46.0030423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0030770Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.0031072Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0031667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:46.0032257Z     return fn(*args, **kwargs)
2025-05-07T20:32:46.0033024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.0033744Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.0034315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.0035097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.0035859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.0036428Z     kernel = self.compile(
2025-05-07T20:32:46.0037002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.0037694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.0038120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0038365Z 
2025-05-07T20:32:46.0038597Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace6b450>
2025-05-07T20:32:46.0039732Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.0041315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acc84fe0>}
2025-05-07T20:32:46.0042720Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.0043796Z context = <triton._C.libtriton.ir.context object at 0x7f07acc797b0>
2025-05-07T20:32:46.0044099Z 
2025-05-07T20:32:46.0044285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.0044839Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.0045332Z                            module_map=module_map)
2025-05-07T20:32:46.0045722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.0046105Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.0046375Z E       ^
2025-05-07T20:32:46.0046870Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.0047339Z 
2025-05-07T20:32:46.0047779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.0048318Z 
2025-05-07T20:32:46.0048438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.0048918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.0049342Z     T=16384,
2025-05-07T20:32:46.0049549Z     D=7168,
2025-05-07T20:32:46.0049749Z     scale_ub=1200.0,
2025-05-07T20:32:46.0049988Z     contiguous=True,
2025-05-07T20:32:46.0050227Z     compiled=True,
2025-05-07T20:32:46.0050438Z )
2025-05-07T20:32:46.0050778Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.0051306Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:46.0051652Z 
2025-05-07T20:32:46.0051738Z     @given(
2025-05-07T20:32:46.0051988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.0052329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.0052656Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.0052998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.0053345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.0053645Z     )
2025-05-07T20:32:46.0054009Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.0054517Z     def test_silu_mul_quant(
2025-05-07T20:32:46.0054773Z         self,
2025-05-07T20:32:46.0054972Z         T: int,
2025-05-07T20:32:46.0055179Z         D: int,
2025-05-07T20:32:46.0055408Z         scale_ub: Optional[float],
2025-05-07T20:32:46.0055687Z         contiguous: bool,
2025-05-07T20:32:46.0055987Z         compiled: bool,
2025-05-07T20:32:46.0056228Z     ) -> None:
2025-05-07T20:32:46.0056453Z         torch.manual_seed(2025)
2025-05-07T20:32:46.0056749Z     
2025-05-07T20:32:46.0057040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.0057394Z     
2025-05-07T20:32:46.0057602Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.0057915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.0058246Z         x = x_sign * x_clamp
2025-05-07T20:32:46.0058501Z         x0 = x[:, :D]
2025-05-07T20:32:46.0058737Z         x1 = x[:, D:]
2025-05-07T20:32:46.0058958Z     
2025-05-07T20:32:46.0059149Z         if contiguous:
2025-05-07T20:32:46.0059400Z             x0 = x0.contiguous()
2025-05-07T20:32:46.0059675Z             x1 = x1.contiguous()
2025-05-07T20:32:46.0059928Z     
2025-05-07T20:32:46.0060139Z         if scale_ub is not None:
2025-05-07T20:32:46.0060433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.0060784Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.0061116Z             )
2025-05-07T20:32:46.0061327Z         else:
2025-05-07T20:32:46.0061548Z             scale_ub_tensor = None
2025-05-07T20:32:46.0061817Z     
2025-05-07T20:32:46.0062066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.0062392Z             op = silu_mul_quant
2025-05-07T20:32:46.0062660Z             if compiled:
2025-05-07T20:32:46.0062929Z                 op = torch.compile(op)
2025-05-07T20:32:46.0063249Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0063532Z     
2025-05-07T20:32:46.0063735Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.0063908Z 
2025-05-07T20:32:46.0064018Z moe/activation_test.py:117: 
2025-05-07T20:32:46.0064321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0064678Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.0064978Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.0065563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:46.0066147Z     return fn(*args, **kwargs)
2025-05-07T20:32:46.0066838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.0067554Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.0068110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.0068823Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.0069521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.0070076Z     kernel = self.compile(
2025-05-07T20:32:46.0070644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.0071334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.0071807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.0072048Z 
2025-05-07T20:32:46.0072266Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad4cc450>
2025-05-07T20:32:46.0073389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.0074812Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acc85e40>}
2025-05-07T20:32:46.0076249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.0077356Z context = <triton._C.libtriton.ir.context object at 0x7f07acd2ce30>
2025-05-07T20:32:46.0077658Z 
2025-05-07T20:32:46.0077891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.0078456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.0078990Z                            module_map=module_map)
2025-05-07T20:32:46.0079369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.0079746Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.0080023Z E       ^
2025-05-07T20:32:46.0080589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.0081062Z 
2025-05-07T20:32:46.0081496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.1235087Z 
2025-05-07T20:32:46.1235452Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.1236127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.1236734Z     T=16384,
2025-05-07T20:32:46.1237022Z     D=5120,
2025-05-07T20:32:46.1237296Z     scale_ub=1200.0,
2025-05-07T20:32:46.1237622Z     contiguous=True,
2025-05-07T20:32:46.1237900Z     compiled=False,
2025-05-07T20:32:46.1238114Z )
2025-05-07T20:32:46.1238451Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.1238972Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.1239263Z 
2025-05-07T20:32:46.1239348Z     @given(
2025-05-07T20:32:46.1239594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.1239928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.1240345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.1240688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.1241038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.1241345Z     )
2025-05-07T20:32:46.1241708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.1242174Z     def test_silu_mul_quant(
2025-05-07T20:32:46.1242428Z         self,
2025-05-07T20:32:46.1242634Z         T: int,
2025-05-07T20:32:46.1242835Z         D: int,
2025-05-07T20:32:46.1243066Z         scale_ub: Optional[float],
2025-05-07T20:32:46.1243352Z         contiguous: bool,
2025-05-07T20:32:46.1243605Z         compiled: bool,
2025-05-07T20:32:46.1243841Z     ) -> None:
2025-05-07T20:32:46.1244065Z         torch.manual_seed(2025)
2025-05-07T20:32:46.1244317Z     
2025-05-07T20:32:46.1244605Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.1244961Z     
2025-05-07T20:32:46.1245159Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.1245464Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.1245787Z         x = x_sign * x_clamp
2025-05-07T20:32:46.1246031Z         x0 = x[:, :D]
2025-05-07T20:32:46.1246557Z         x1 = x[:, D:]
2025-05-07T20:32:46.1246784Z     
2025-05-07T20:32:46.1246972Z         if contiguous:
2025-05-07T20:32:46.1247216Z             x0 = x0.contiguous()
2025-05-07T20:32:46.1247487Z             x1 = x1.contiguous()
2025-05-07T20:32:46.1247732Z     
2025-05-07T20:32:46.1247932Z         if scale_ub is not None:
2025-05-07T20:32:46.1248219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.1248598Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.1248940Z             )
2025-05-07T20:32:46.1249141Z         else:
2025-05-07T20:32:46.1249453Z             scale_ub_tensor = None
2025-05-07T20:32:46.1249717Z     
2025-05-07T20:32:46.1249956Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.1250287Z             op = silu_mul_quant
2025-05-07T20:32:46.1250550Z             if compiled:
2025-05-07T20:32:46.1250806Z                 op = torch.compile(op)
2025-05-07T20:32:46.1251193Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.1251481Z     
2025-05-07T20:32:46.1251748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.1251927Z 
2025-05-07T20:32:46.1252032Z moe/activation_test.py:117: 
2025-05-07T20:32:46.1252342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.1252697Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.1252987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.1253707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.1254428Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.1254983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.1255699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.1256395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.1256953Z     kernel = self.compile(
2025-05-07T20:32:46.1257513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.1258198Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.1258612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.1258849Z 
2025-05-07T20:32:46.1259067Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa56d0>
2025-05-07T20:32:46.1260184Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.1261630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acc86ca0>}
2025-05-07T20:32:46.1263021Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.1264083Z context = <triton._C.libtriton.ir.context object at 0x7f07acd60df0>
2025-05-07T20:32:46.1264409Z 
2025-05-07T20:32:46.1264588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.1265128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.1265619Z                            module_map=module_map)
2025-05-07T20:32:46.1266001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.1266372Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.1266637Z E       ^
2025-05-07T20:32:46.1267175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.1267644Z 
2025-05-07T20:32:46.1268084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.1268614Z 
2025-05-07T20:32:46.1268727Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.1269153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.1269572Z     T=1,
2025-05-07T20:32:46.1269764Z     D=7168,
2025-05-07T20:32:46.1269967Z     scale_ub=1200.0,
2025-05-07T20:32:46.1270208Z     contiguous=False,
2025-05-07T20:32:46.1270490Z     compiled=False,
2025-05-07T20:32:46.1270698Z )
2025-05-07T20:32:46.1271035Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.1271548Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:46.1271825Z 
2025-05-07T20:32:46.1271949Z     @given(
2025-05-07T20:32:46.1272195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.1272560Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.1272885Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.1273225Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.1273572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.1273873Z     )
2025-05-07T20:32:46.1274233Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.1274693Z     def test_silu_mul_quant(
2025-05-07T20:32:46.1274951Z         self,
2025-05-07T20:32:46.1275154Z         T: int,
2025-05-07T20:32:46.1275363Z         D: int,
2025-05-07T20:32:46.1275592Z         scale_ub: Optional[float],
2025-05-07T20:32:46.1275870Z         contiguous: bool,
2025-05-07T20:32:46.1276134Z         compiled: bool,
2025-05-07T20:32:46.1276367Z     ) -> None:
2025-05-07T20:32:46.1276589Z         torch.manual_seed(2025)
2025-05-07T20:32:46.1276845Z     
2025-05-07T20:32:46.1277132Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.1277485Z     
2025-05-07T20:32:46.1277690Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.1277995Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.1278317Z         x = x_sign * x_clamp
2025-05-07T20:32:46.1278564Z         x0 = x[:, :D]
2025-05-07T20:32:46.1278791Z         x1 = x[:, D:]
2025-05-07T20:32:46.1279008Z     
2025-05-07T20:32:46.1279198Z         if contiguous:
2025-05-07T20:32:46.1279439Z             x0 = x0.contiguous()
2025-05-07T20:32:46.1279710Z             x1 = x1.contiguous()
2025-05-07T20:32:46.1279959Z     
2025-05-07T20:32:46.1280253Z         if scale_ub is not None:
2025-05-07T20:32:46.1280545Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.1280891Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.1281217Z             )
2025-05-07T20:32:46.1281425Z         else:
2025-05-07T20:32:46.1281639Z             scale_ub_tensor = None
2025-05-07T20:32:46.1281905Z     
2025-05-07T20:32:46.1282151Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.1282476Z             op = silu_mul_quant
2025-05-07T20:32:46.1282739Z             if compiled:
2025-05-07T20:32:46.1282998Z                 op = torch.compile(op)
2025-05-07T20:32:46.1283306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.1283588Z     
2025-05-07T20:32:46.1283791Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.1283962Z 
2025-05-07T20:32:46.1284071Z moe/activation_test.py:117: 
2025-05-07T20:32:46.1284374Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.1284727Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.1285022Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.1285730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.1286454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.1287068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.1287784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.1288470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.1289026Z     kernel = self.compile(
2025-05-07T20:32:46.1289592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.1290311Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.1290727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.1290971Z 
2025-05-07T20:32:46.1291188Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad753550>
2025-05-07T20:32:46.1292385Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.1294003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd740e0>}
2025-05-07T20:32:46.1295390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.1296456Z context = <triton._C.libtriton.ir.context object at 0x7f07acdf6870>
2025-05-07T20:32:46.1296763Z 
2025-05-07T20:32:46.1296940Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.1297489Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.1297976Z                            module_map=module_map)
2025-05-07T20:32:46.1298363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.1298741Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.1299005Z E       ^
2025-05-07T20:32:46.1299487Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.1299959Z 
2025-05-07T20:32:46.1300394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.1300923Z 
2025-05-07T20:32:46.1301039Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.1301465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.1301883Z     T=4096,
2025-05-07T20:32:46.1302078Z     D=7168,
2025-05-07T20:32:46.1302274Z     scale_ub=1200.0,
2025-05-07T20:32:46.1302510Z     contiguous=False,
2025-05-07T20:32:46.1302750Z     compiled=True,
2025-05-07T20:32:46.2913023Z )
2025-05-07T20:32:46.2913742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.2914494Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:46.2914891Z 
2025-05-07T20:32:46.2915017Z     @given(
2025-05-07T20:32:46.2915274Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.2915617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.2915946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.2916302Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.2916654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.2916962Z     )
2025-05-07T20:32:46.2917336Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.2917800Z     def test_silu_mul_quant(
2025-05-07T20:32:46.2918061Z         self,
2025-05-07T20:32:46.2918285Z         T: int,
2025-05-07T20:32:46.2918499Z         D: int,
2025-05-07T20:32:46.2919051Z         scale_ub: Optional[float],
2025-05-07T20:32:46.2919348Z         contiguous: bool,
2025-05-07T20:32:46.2919600Z         compiled: bool,
2025-05-07T20:32:46.2919843Z     ) -> None:
2025-05-07T20:32:46.2920150Z         torch.manual_seed(2025)
2025-05-07T20:32:46.2920437Z     
2025-05-07T20:32:46.2920728Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.2921080Z     
2025-05-07T20:32:46.2921292Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.2921599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.2922014Z         x = x_sign * x_clamp
2025-05-07T20:32:46.2922269Z         x0 = x[:, :D]
2025-05-07T20:32:46.2922501Z         x1 = x[:, D:]
2025-05-07T20:32:46.2922717Z     
2025-05-07T20:32:46.2922915Z         if contiguous:
2025-05-07T20:32:46.2923160Z             x0 = x0.contiguous()
2025-05-07T20:32:46.2923428Z             x1 = x1.contiguous()
2025-05-07T20:32:46.2923764Z     
2025-05-07T20:32:46.2923975Z         if scale_ub is not None:
2025-05-07T20:32:46.2924360Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.2924721Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.2925047Z             )
2025-05-07T20:32:46.2925252Z         else:
2025-05-07T20:32:46.2925470Z             scale_ub_tensor = None
2025-05-07T20:32:46.2925735Z     
2025-05-07T20:32:46.2925981Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.2926308Z             op = silu_mul_quant
2025-05-07T20:32:46.2926577Z             if compiled:
2025-05-07T20:32:46.2926842Z                 op = torch.compile(op)
2025-05-07T20:32:46.2927149Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2927438Z     
2025-05-07T20:32:46.2927646Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.2927819Z 
2025-05-07T20:32:46.2927928Z moe/activation_test.py:117: 
2025-05-07T20:32:46.2928243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2928606Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.2928911Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2929492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:46.2930080Z     return fn(*args, **kwargs)
2025-05-07T20:32:46.2930790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.2931508Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.2932068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.2932786Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.2933594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.2941071Z     kernel = self.compile(
2025-05-07T20:32:46.2941704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.2942412Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.2942847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2943094Z 
2025-05-07T20:32:46.2943329Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa60d0>
2025-05-07T20:32:46.2944455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.2945901Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd75300>}
2025-05-07T20:32:46.2947383Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.2948459Z context = <triton._C.libtriton.ir.context object at 0x7f07ad1677f0>
2025-05-07T20:32:46.2948764Z 
2025-05-07T20:32:46.2948950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.2949498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.2950001Z                            module_map=module_map)
2025-05-07T20:32:46.2950440Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.2950813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.2951095Z E       ^
2025-05-07T20:32:46.2951590Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.2952108Z 
2025-05-07T20:32:46.2952593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.2953129Z 
2025-05-07T20:32:46.2953241Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.2953683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.2954111Z     T=128,
2025-05-07T20:32:46.2954306Z     D=7168,
2025-05-07T20:32:46.2954519Z     scale_ub=1200.0,
2025-05-07T20:32:46.2954758Z     contiguous=False,
2025-05-07T20:32:46.2954992Z     compiled=True,
2025-05-07T20:32:46.2955211Z )
2025-05-07T20:32:46.2955551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.2956071Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:46.2956354Z 
2025-05-07T20:32:46.2956438Z     @given(
2025-05-07T20:32:46.2956687Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.2957022Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.2957345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.2957703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.2958054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.2958354Z     )
2025-05-07T20:32:46.2958727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.2959198Z     def test_silu_mul_quant(
2025-05-07T20:32:46.2959461Z         self,
2025-05-07T20:32:46.2959663Z         T: int,
2025-05-07T20:32:46.2959876Z         D: int,
2025-05-07T20:32:46.2960209Z         scale_ub: Optional[float],
2025-05-07T20:32:46.2960497Z         contiguous: bool,
2025-05-07T20:32:46.2960757Z         compiled: bool,
2025-05-07T20:32:46.2961004Z     ) -> None:
2025-05-07T20:32:46.2961228Z         torch.manual_seed(2025)
2025-05-07T20:32:46.2961486Z     
2025-05-07T20:32:46.2961781Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.2962142Z     
2025-05-07T20:32:46.2962358Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.2962673Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.2963000Z         x = x_sign * x_clamp
2025-05-07T20:32:46.2963261Z         x0 = x[:, :D]
2025-05-07T20:32:46.2963496Z         x1 = x[:, D:]
2025-05-07T20:32:46.2963715Z     
2025-05-07T20:32:46.2963915Z         if contiguous:
2025-05-07T20:32:46.2964162Z             x0 = x0.contiguous()
2025-05-07T20:32:46.2964430Z             x1 = x1.contiguous()
2025-05-07T20:32:46.2964686Z     
2025-05-07T20:32:46.2964894Z         if scale_ub is not None:
2025-05-07T20:32:46.2965189Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.2965544Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.2965878Z             )
2025-05-07T20:32:46.2966089Z         else:
2025-05-07T20:32:46.2966310Z             scale_ub_tensor = None
2025-05-07T20:32:46.2966580Z     
2025-05-07T20:32:46.2966831Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.2967217Z             op = silu_mul_quant
2025-05-07T20:32:46.2967489Z             if compiled:
2025-05-07T20:32:46.2967758Z                 op = torch.compile(op)
2025-05-07T20:32:46.2968069Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2968366Z     
2025-05-07T20:32:46.2968576Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.2968749Z 
2025-05-07T20:32:46.2968852Z moe/activation_test.py:117: 
2025-05-07T20:32:46.2969164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2969516Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.2969862Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.2970445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:46.2971033Z     return fn(*args, **kwargs)
2025-05-07T20:32:46.2971730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.2972551Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.2973125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.2973844Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.2974546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.2975103Z     kernel = self.compile(
2025-05-07T20:32:46.2975678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.2976374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.2976804Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.2977047Z 
2025-05-07T20:32:46.2977266Z self = <triton.compiler.compiler.ASTSource object at 0x7f07aca11ad0>
2025-05-07T20:32:46.2978402Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.2979829Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd76160>}
2025-05-07T20:32:46.2981226Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.2982288Z context = <triton._C.libtriton.ir.context object at 0x7f07ad122d70>
2025-05-07T20:32:46.2982598Z 
2025-05-07T20:32:46.2982776Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.2983332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.2983827Z                            module_map=module_map)
2025-05-07T20:32:46.2984210Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.2984589Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.2984868Z E       ^
2025-05-07T20:32:46.2985353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.2985834Z 
2025-05-07T20:32:46.2986267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.2986811Z 
2025-05-07T20:32:46.2986923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.2987364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.2987783Z     T=2048,
2025-05-07T20:32:46.2987986Z     D=7168,
2025-05-07T20:32:46.2988198Z     scale_ub=None,
2025-05-07T20:32:46.2988423Z     contiguous=True,
2025-05-07T20:32:46.2988722Z     compiled=True,
2025-05-07T20:32:46.4272826Z )
2025-05-07T20:32:46.4273896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.4275333Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.4276017Z 
2025-05-07T20:32:46.4276179Z     @given(
2025-05-07T20:32:46.4276652Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.4277296Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.4277925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.4278806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.4279180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.4279475Z     )
2025-05-07T20:32:46.4279845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.4280415Z     def test_silu_mul_quant(
2025-05-07T20:32:46.4280758Z         self,
2025-05-07T20:32:46.4280966Z         T: int,
2025-05-07T20:32:46.4281182Z         D: int,
2025-05-07T20:32:46.4281489Z         scale_ub: Optional[float],
2025-05-07T20:32:46.4281775Z         contiguous: bool,
2025-05-07T20:32:46.4282030Z         compiled: bool,
2025-05-07T20:32:46.4282271Z     ) -> None:
2025-05-07T20:32:46.4282493Z         torch.manual_seed(2025)
2025-05-07T20:32:46.4282752Z     
2025-05-07T20:32:46.4283040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.4283394Z     
2025-05-07T20:32:46.4283605Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.4283915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.4284240Z         x = x_sign * x_clamp
2025-05-07T20:32:46.4284500Z         x0 = x[:, :D]
2025-05-07T20:32:46.4284737Z         x1 = x[:, D:]
2025-05-07T20:32:46.4284953Z     
2025-05-07T20:32:46.4285154Z         if contiguous:
2025-05-07T20:32:46.4285404Z             x0 = x0.contiguous()
2025-05-07T20:32:46.4285681Z             x1 = x1.contiguous()
2025-05-07T20:32:46.4285941Z     
2025-05-07T20:32:46.4286150Z         if scale_ub is not None:
2025-05-07T20:32:46.4286434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.4286795Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.4287129Z             )
2025-05-07T20:32:46.4287335Z         else:
2025-05-07T20:32:46.4287554Z             scale_ub_tensor = None
2025-05-07T20:32:46.4287822Z     
2025-05-07T20:32:46.4288070Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.4288395Z             op = silu_mul_quant
2025-05-07T20:32:46.4288662Z             if compiled:
2025-05-07T20:32:46.4288922Z                 op = torch.compile(op)
2025-05-07T20:32:46.4289235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.4289527Z     
2025-05-07T20:32:46.4289725Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.4289903Z 
2025-05-07T20:32:46.4290009Z moe/activation_test.py:117: 
2025-05-07T20:32:46.4290327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.4290681Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.4290970Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.4291552Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:46.4292134Z     return fn(*args, **kwargs)
2025-05-07T20:32:46.4292815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.4293532Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.4294096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.4294805Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.4295492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.4296144Z     kernel = self.compile(
2025-05-07T20:32:46.4296716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.4297400Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.4297813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.4298054Z 
2025-05-07T20:32:46.4298268Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace69650>
2025-05-07T20:32:46.4299386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.4300867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd77420>}
2025-05-07T20:32:46.4302330Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.4303392Z context = <triton._C.libtriton.ir.context object at 0x7f07acbaa6b0>
2025-05-07T20:32:46.4303698Z 
2025-05-07T20:32:46.4303871Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.4304416Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.4304899Z                            module_map=module_map)
2025-05-07T20:32:46.4305280Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.4305651Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.4305917Z E       ^
2025-05-07T20:32:46.4306400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.4306875Z 
2025-05-07T20:32:46.4307309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.4307856Z 
2025-05-07T20:32:46.4307964Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.4308399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.4308818Z     T=16384,
2025-05-07T20:32:46.4309023Z     D=5120,
2025-05-07T20:32:46.4309230Z     scale_ub=None,
2025-05-07T20:32:46.4309457Z     contiguous=False,
2025-05-07T20:32:46.4309690Z     compiled=False,
2025-05-07T20:32:46.4309908Z )
2025-05-07T20:32:46.4310243Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.4310759Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:46.4311058Z 
2025-05-07T20:32:46.4311143Z     @given(
2025-05-07T20:32:46.4311389Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.4311718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.4312043Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.4312388Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.4312729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.4313028Z     )
2025-05-07T20:32:46.4313678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.4314142Z     def test_silu_mul_quant(
2025-05-07T20:32:46.4314387Z         self,
2025-05-07T20:32:46.4314594Z         T: int,
2025-05-07T20:32:46.4314800Z         D: int,
2025-05-07T20:32:46.4315025Z         scale_ub: Optional[float],
2025-05-07T20:32:46.4315309Z         contiguous: bool,
2025-05-07T20:32:46.4315560Z         compiled: bool,
2025-05-07T20:32:46.4315787Z     ) -> None:
2025-05-07T20:32:46.4316010Z         torch.manual_seed(2025)
2025-05-07T20:32:46.4316262Z     
2025-05-07T20:32:46.4316543Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.4316970Z     
2025-05-07T20:32:46.4317182Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.4317481Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.4319579Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.4321671Z 
2025-05-07T20:32:46.4321799Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:46.4322027Z 
2025-05-07T20:32:46.4322198Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.4322638Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.4323103Z     T=4096,
2025-05-07T20:32:46.4323302Z     D=7168,
2025-05-07T20:32:46.4323507Z     scale_ub=1200.0,
2025-05-07T20:32:46.4323761Z     contiguous=True,
2025-05-07T20:32:46.4323987Z     compiled=True,
2025-05-07T20:32:46.4324203Z )
2025-05-07T20:32:46.4324541Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.4325057Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:46.4325337Z 
2025-05-07T20:32:46.4325418Z     @given(
2025-05-07T20:32:46.4325661Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.4325990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.4326305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.4326650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.4326996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.4327288Z     )
2025-05-07T20:32:46.4327663Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.4328123Z     def test_silu_mul_quant(
2025-05-07T20:32:46.4328373Z         self,
2025-05-07T20:32:46.4328572Z         T: int,
2025-05-07T20:32:46.4328778Z         D: int,
2025-05-07T20:32:46.4329010Z         scale_ub: Optional[float],
2025-05-07T20:32:46.4329288Z         contiguous: bool,
2025-05-07T20:32:46.4329539Z         compiled: bool,
2025-05-07T20:32:46.4329771Z     ) -> None:
2025-05-07T20:32:46.4329991Z         torch.manual_seed(2025)
2025-05-07T20:32:46.4330245Z     
2025-05-07T20:32:46.4330529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.4330879Z     
2025-05-07T20:32:46.4331085Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.4331390Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.4333467Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.4335398Z 
2025-05-07T20:32:46.4335525Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:46.4335749Z 
2025-05-07T20:32:46.4335857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.4336288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.4336710Z     T=16384,
2025-05-07T20:32:46.4336909Z     D=7168,
2025-05-07T20:32:46.4337108Z     scale_ub=None,
2025-05-07T20:32:46.4337336Z     contiguous=False,
2025-05-07T20:32:46.4337566Z     compiled=False,
2025-05-07T20:32:46.4337854Z )
2025-05-07T20:32:46.4338191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.4338710Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:46.4339011Z 
2025-05-07T20:32:46.4339091Z     @given(
2025-05-07T20:32:46.4339337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.4339667Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.4339981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.4340326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.4340714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.4341005Z     )
2025-05-07T20:32:46.4341370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.4341832Z     def test_silu_mul_quant(
2025-05-07T20:32:46.4342157Z         self,
2025-05-07T20:32:46.4342365Z         T: int,
2025-05-07T20:32:46.4342575Z         D: int,
2025-05-07T20:32:46.4342833Z         scale_ub: Optional[float],
2025-05-07T20:32:46.4343121Z         contiguous: bool,
2025-05-07T20:32:46.4343378Z         compiled: bool,
2025-05-07T20:32:46.4343605Z     ) -> None:
2025-05-07T20:32:46.4343832Z         torch.manual_seed(2025)
2025-05-07T20:32:46.4344083Z     
2025-05-07T20:32:46.4344365Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.4346484Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.4348425Z 
2025-05-07T20:32:46.4348550Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.5584990Z 
2025-05-07T20:32:46.5585364Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.5586081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.5586667Z     T=2048,
2025-05-07T20:32:46.5586936Z     D=7168,
2025-05-07T20:32:46.5587163Z     scale_ub=1200.0,
2025-05-07T20:32:46.5587404Z     contiguous=True,
2025-05-07T20:32:46.5587635Z     compiled=True,
2025-05-07T20:32:46.5587852Z )
2025-05-07T20:32:46.5588205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.5588725Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:46.5589016Z 
2025-05-07T20:32:46.5589099Z     @given(
2025-05-07T20:32:46.5589344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.5589683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.5590017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.5590372Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.5590723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.5591020Z     )
2025-05-07T20:32:46.5591392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.5591861Z     def test_silu_mul_quant(
2025-05-07T20:32:46.5592113Z         self,
2025-05-07T20:32:46.5592324Z         T: int,
2025-05-07T20:32:46.5592536Z         D: int,
2025-05-07T20:32:46.5592770Z         scale_ub: Optional[float],
2025-05-07T20:32:46.5593060Z         contiguous: bool,
2025-05-07T20:32:46.5593318Z         compiled: bool,
2025-05-07T20:32:46.5593560Z     ) -> None:
2025-05-07T20:32:46.5593788Z         torch.manual_seed(2025)
2025-05-07T20:32:46.5594048Z     
2025-05-07T20:32:46.5594339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.5594702Z     
2025-05-07T20:32:46.5595190Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.5595512Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.5597594Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.5599674Z 
2025-05-07T20:32:46.5599799Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:46.5600030Z 
2025-05-07T20:32:46.5600271Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.5600794Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.5601285Z     T=2048,
2025-05-07T20:32:46.5601485Z     D=7168,
2025-05-07T20:32:46.5601692Z     scale_ub=None,
2025-05-07T20:32:46.5601928Z     contiguous=True,
2025-05-07T20:32:46.5602161Z     compiled=False,
2025-05-07T20:32:46.5602381Z )
2025-05-07T20:32:46.5602719Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.5603234Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.5603521Z 
2025-05-07T20:32:46.5603604Z     @given(
2025-05-07T20:32:46.5603884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.5604215Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.5604543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.5604887Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.5605232Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.5605536Z     )
2025-05-07T20:32:46.5605908Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.5606364Z     def test_silu_mul_quant(
2025-05-07T20:32:46.5606620Z         self,
2025-05-07T20:32:46.5606823Z         T: int,
2025-05-07T20:32:46.5607025Z         D: int,
2025-05-07T20:32:46.5607256Z         scale_ub: Optional[float],
2025-05-07T20:32:46.5607545Z         contiguous: bool,
2025-05-07T20:32:46.5607792Z         compiled: bool,
2025-05-07T20:32:46.5608024Z     ) -> None:
2025-05-07T20:32:46.5608251Z         torch.manual_seed(2025)
2025-05-07T20:32:46.5608499Z     
2025-05-07T20:32:46.5608793Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.5609151Z     
2025-05-07T20:32:46.5609350Z >       x_sign = torch.sign(x)
2025-05-07T20:32:46.5611360Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.5613288Z 
2025-05-07T20:32:46.5613710Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:46.5613942Z 
2025-05-07T20:32:46.5614051Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.5614488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.5614906Z     T=1,
2025-05-07T20:32:46.5615100Z     D=7168,
2025-05-07T20:32:46.5615305Z     scale_ub=1200.0,
2025-05-07T20:32:46.5615534Z     contiguous=True,
2025-05-07T20:32:46.5615771Z     compiled=False,
2025-05-07T20:32:46.5615991Z )
2025-05-07T20:32:46.5616400Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.5616917Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.5617205Z 
2025-05-07T20:32:46.5617286Z     @given(
2025-05-07T20:32:46.5617530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.5617854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.5618183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.5618533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.5618872Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.5619237Z     )
2025-05-07T20:32:46.5619606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.5620062Z     def test_silu_mul_quant(
2025-05-07T20:32:46.5620317Z         self,
2025-05-07T20:32:46.5620536Z         T: int,
2025-05-07T20:32:46.5620744Z         D: int,
2025-05-07T20:32:46.5621036Z         scale_ub: Optional[float],
2025-05-07T20:32:46.5621327Z         contiguous: bool,
2025-05-07T20:32:46.5621637Z         compiled: bool,
2025-05-07T20:32:46.5621947Z     ) -> None:
2025-05-07T20:32:46.5629967Z         torch.manual_seed(2025)
2025-05-07T20:32:46.5630270Z     
2025-05-07T20:32:46.5630561Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.5630922Z     
2025-05-07T20:32:46.5631129Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.5631432Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.5631761Z         x = x_sign * x_clamp
2025-05-07T20:32:46.5632016Z         x0 = x[:, :D]
2025-05-07T20:32:46.5632244Z         x1 = x[:, D:]
2025-05-07T20:32:46.5632462Z     
2025-05-07T20:32:46.5632659Z         if contiguous:
2025-05-07T20:32:46.5632894Z             x0 = x0.contiguous()
2025-05-07T20:32:46.5633166Z             x1 = x1.contiguous()
2025-05-07T20:32:46.5633421Z     
2025-05-07T20:32:46.5633617Z         if scale_ub is not None:
2025-05-07T20:32:46.5633908Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.5634268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.5634587Z             )
2025-05-07T20:32:46.5634796Z         else:
2025-05-07T20:32:46.5635017Z             scale_ub_tensor = None
2025-05-07T20:32:46.5635283Z     
2025-05-07T20:32:46.5635523Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.5635859Z             op = silu_mul_quant
2025-05-07T20:32:46.5636123Z             if compiled:
2025-05-07T20:32:46.5636379Z                 op = torch.compile(op)
2025-05-07T20:32:46.5636692Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.5636987Z     
2025-05-07T20:32:46.5637185Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.5637366Z 
2025-05-07T20:32:46.5637470Z moe/activation_test.py:117: 
2025-05-07T20:32:46.5637783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.5638131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.5638434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.5639164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.5639889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.5640541Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.5641260Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.5641954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.5642504Z     kernel = self.compile(
2025-05-07T20:32:46.5643071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.5643759Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.5644264Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.5644504Z 
2025-05-07T20:32:46.5644723Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf579d0>
2025-05-07T20:32:46.5645848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.5647285Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acb462a0>}
2025-05-07T20:32:46.5648718Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.5649782Z context = <triton._C.libtriton.ir.context object at 0x7f07acae9470>
2025-05-07T20:32:46.5650122Z 
2025-05-07T20:32:46.5650340Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.5650892Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.5651387Z                            module_map=module_map)
2025-05-07T20:32:46.5651759Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.5652134Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.5652404Z E       ^
2025-05-07T20:32:46.5652888Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.5653358Z 
2025-05-07T20:32:46.5653789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.5654327Z 
2025-05-07T20:32:46.5654439Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.5654877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.5655298Z     T=128,
2025-05-07T20:32:46.5655491Z     D=5120,
2025-05-07T20:32:46.5655693Z     scale_ub=None,
2025-05-07T20:32:46.5655916Z     contiguous=True,
2025-05-07T20:32:46.5656141Z     compiled=False,
2025-05-07T20:32:46.5656358Z )
2025-05-07T20:32:46.5656692Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.5657198Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.5657484Z 
2025-05-07T20:32:46.5657564Z     @given(
2025-05-07T20:32:46.5657801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.5658122Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.5658442Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.5658784Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.5659130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.5659429Z     )
2025-05-07T20:32:46.5659796Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.5660260Z     def test_silu_mul_quant(
2025-05-07T20:32:46.5660505Z         self,
2025-05-07T20:32:46.5660711Z         T: int,
2025-05-07T20:32:46.5660918Z         D: int,
2025-05-07T20:32:46.5661140Z         scale_ub: Optional[float],
2025-05-07T20:32:46.5661425Z         contiguous: bool,
2025-05-07T20:32:46.5661677Z         compiled: bool,
2025-05-07T20:32:46.5661903Z     ) -> None:
2025-05-07T20:32:46.5662127Z         torch.manual_seed(2025)
2025-05-07T20:32:46.5662380Z     
2025-05-07T20:32:46.5662658Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.5663020Z     
2025-05-07T20:32:46.5663227Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.5663527Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.5663850Z         x = x_sign * x_clamp
2025-05-07T20:32:46.5664102Z         x0 = x[:, :D]
2025-05-07T20:32:46.5664334Z         x1 = x[:, D:]
2025-05-07T20:32:46.5664543Z     
2025-05-07T20:32:46.5664792Z         if contiguous:
2025-05-07T20:32:46.5665038Z             x0 = x0.contiguous()
2025-05-07T20:32:46.5665302Z             x1 = x1.contiguous()
2025-05-07T20:32:46.5665550Z     
2025-05-07T20:32:46.5665754Z         if scale_ub is not None:
2025-05-07T20:32:46.5666032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.5666387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.5666709Z             )
2025-05-07T20:32:46.5666904Z         else:
2025-05-07T20:32:46.5667125Z             scale_ub_tensor = None
2025-05-07T20:32:46.5667440Z     
2025-05-07T20:32:46.5667676Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.5668007Z             op = silu_mul_quant
2025-05-07T20:32:46.5668267Z             if compiled:
2025-05-07T20:32:46.5668514Z                 op = torch.compile(op)
2025-05-07T20:32:46.5668824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.5669149Z     
2025-05-07T20:32:46.5669346Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.5669521Z 
2025-05-07T20:32:46.5669662Z moe/activation_test.py:117: 
2025-05-07T20:32:46.5669972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.5670320Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.5670605Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.5671317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.5672028Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.5672582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.5673291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.5673979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.5674540Z     kernel = self.compile(
2025-05-07T20:32:46.5675097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.5675779Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.5676194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.5676429Z 
2025-05-07T20:32:46.5676647Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf576d0>
2025-05-07T20:32:46.5677763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.5679191Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acb471a0>}
2025-05-07T20:32:46.5680671Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.5681738Z context = <triton._C.libtriton.ir.context object at 0x7f07acac6f30>
2025-05-07T20:32:46.5682036Z 
2025-05-07T20:32:46.5682209Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.5682753Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.5683241Z                            module_map=module_map)
2025-05-07T20:32:46.5683622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.5683983Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.5684256Z E       ^
2025-05-07T20:32:46.5684736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.5685207Z 
2025-05-07T20:32:46.5685697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.6810359Z 
2025-05-07T20:32:46.6810628Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.6811310Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.6811901Z     T=128,
2025-05-07T20:32:46.6812129Z     D=7168,
2025-05-07T20:32:46.6812339Z     scale_ub=None,
2025-05-07T20:32:46.6812569Z     contiguous=True,
2025-05-07T20:32:46.6812803Z     compiled=False,
2025-05-07T20:32:46.6813026Z )
2025-05-07T20:32:46.6813777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.6814290Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.6814580Z 
2025-05-07T20:32:46.6814667Z     @given(
2025-05-07T20:32:46.6814921Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.6815345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.6815752Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.6816111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.6816462Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.6816762Z     )
2025-05-07T20:32:46.6817132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.6817626Z     def test_silu_mul_quant(
2025-05-07T20:32:46.6817881Z         self,
2025-05-07T20:32:46.6818088Z         T: int,
2025-05-07T20:32:46.6818292Z         D: int,
2025-05-07T20:32:46.6818526Z         scale_ub: Optional[float],
2025-05-07T20:32:46.6818845Z         contiguous: bool,
2025-05-07T20:32:46.6819123Z         compiled: bool,
2025-05-07T20:32:46.6819363Z     ) -> None:
2025-05-07T20:32:46.6819593Z         torch.manual_seed(2025)
2025-05-07T20:32:46.6819843Z     
2025-05-07T20:32:46.6820137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.6820503Z     
2025-05-07T20:32:46.6820716Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.6821030Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.6821374Z         x = x_sign * x_clamp
2025-05-07T20:32:46.6821631Z         x0 = x[:, :D]
2025-05-07T20:32:46.6821864Z         x1 = x[:, D:]
2025-05-07T20:32:46.6822082Z     
2025-05-07T20:32:46.6822284Z         if contiguous:
2025-05-07T20:32:46.6822534Z             x0 = x0.contiguous()
2025-05-07T20:32:46.6822807Z             x1 = x1.contiguous()
2025-05-07T20:32:46.6823065Z     
2025-05-07T20:32:46.6823272Z         if scale_ub is not None:
2025-05-07T20:32:46.6823563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.6823922Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.6824253Z             )
2025-05-07T20:32:46.6824458Z         else:
2025-05-07T20:32:46.6824686Z             scale_ub_tensor = None
2025-05-07T20:32:46.6824961Z     
2025-05-07T20:32:46.6825205Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.6825543Z             op = silu_mul_quant
2025-05-07T20:32:46.6825814Z             if compiled:
2025-05-07T20:32:46.6826080Z                 op = torch.compile(op)
2025-05-07T20:32:46.6826391Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.6826684Z     
2025-05-07T20:32:46.6826890Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.6827063Z 
2025-05-07T20:32:46.6827169Z moe/activation_test.py:117: 
2025-05-07T20:32:46.6827484Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.6827843Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.6828136Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.6828859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.6829631Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.6830284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.6830998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.6831697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.6832256Z     kernel = self.compile(
2025-05-07T20:32:46.6832819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.6833507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.6834022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.6834264Z 
2025-05-07T20:32:46.6834487Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898bbfe50>
2025-05-07T20:32:46.6835690Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.6837176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07aca58040>}
2025-05-07T20:32:46.6838572Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.6839636Z context = <triton._C.libtriton.ir.context object at 0x7f07ac9b7170>
2025-05-07T20:32:46.6839942Z 
2025-05-07T20:32:46.6840191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.6840739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.6841232Z                            module_map=module_map)
2025-05-07T20:32:46.6841629Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.6842003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.6842280Z E       ^
2025-05-07T20:32:46.6842769Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.6843239Z 
2025-05-07T20:32:46.6843682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.6844218Z 
2025-05-07T20:32:46.6844331Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.6844776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.6845202Z     T=2048,
2025-05-07T20:32:46.6845398Z     D=7168,
2025-05-07T20:32:46.6845607Z     scale_ub=1200.0,
2025-05-07T20:32:46.6845850Z     contiguous=True,
2025-05-07T20:32:46.6846086Z     compiled=False,
2025-05-07T20:32:46.6846314Z )
2025-05-07T20:32:46.6846658Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.6847191Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.6847476Z 
2025-05-07T20:32:46.6847557Z     @given(
2025-05-07T20:32:46.6847809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.6848143Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.6848467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.6848822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.6849184Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.6849493Z     )
2025-05-07T20:32:46.6849868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.6850339Z     def test_silu_mul_quant(
2025-05-07T20:32:46.6850596Z         self,
2025-05-07T20:32:46.6850797Z         T: int,
2025-05-07T20:32:46.6851010Z         D: int,
2025-05-07T20:32:46.6851245Z         scale_ub: Optional[float],
2025-05-07T20:32:46.6851576Z         contiguous: bool,
2025-05-07T20:32:46.6851843Z         compiled: bool,
2025-05-07T20:32:46.6852080Z     ) -> None:
2025-05-07T20:32:46.6852306Z         torch.manual_seed(2025)
2025-05-07T20:32:46.6852564Z     
2025-05-07T20:32:46.6852856Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.6855001Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.6857004Z 
2025-05-07T20:32:46.6857131Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.6857363Z 
2025-05-07T20:32:46.6857510Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.6857953Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.6858376Z     T=1,
2025-05-07T20:32:46.6858568Z     D=5120,
2025-05-07T20:32:46.6858775Z     scale_ub=1200.0,
2025-05-07T20:32:46.6859053Z     contiguous=True,
2025-05-07T20:32:46.6859289Z     compiled=False,
2025-05-07T20:32:46.6859508Z )
2025-05-07T20:32:46.6859843Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.6860354Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.6860636Z 
2025-05-07T20:32:46.6860717Z     @given(
2025-05-07T20:32:46.6860966Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.6861305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.6861625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.6861988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.6862342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.6862641Z     )
2025-05-07T20:32:46.6863012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.6863488Z     def test_silu_mul_quant(
2025-05-07T20:32:46.6863742Z         self,
2025-05-07T20:32:46.6863953Z         T: int,
2025-05-07T20:32:46.6864172Z         D: int,
2025-05-07T20:32:46.6864401Z         scale_ub: Optional[float],
2025-05-07T20:32:46.6864696Z         contiguous: bool,
2025-05-07T20:32:46.6864963Z         compiled: bool,
2025-05-07T20:32:46.6865195Z     ) -> None:
2025-05-07T20:32:46.6865425Z         torch.manual_seed(2025)
2025-05-07T20:32:46.6865689Z     
2025-05-07T20:32:46.6865972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.6866337Z     
2025-05-07T20:32:46.6866550Z         x_sign = torch.sign(x)
2025-05-07T20:32:46.6866865Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:46.6867195Z         x = x_sign * x_clamp
2025-05-07T20:32:46.6867462Z         x0 = x[:, :D]
2025-05-07T20:32:46.6867694Z         x1 = x[:, D:]
2025-05-07T20:32:46.6867910Z     
2025-05-07T20:32:46.6868109Z         if contiguous:
2025-05-07T20:32:46.6868401Z             x0 = x0.contiguous()
2025-05-07T20:32:46.6868735Z             x1 = x1.contiguous()
2025-05-07T20:32:46.6869057Z     
2025-05-07T20:32:46.6869315Z         if scale_ub is not None:
2025-05-07T20:32:46.6869673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:46.6870119Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:46.6870501Z             )
2025-05-07T20:32:46.6870705Z         else:
2025-05-07T20:32:46.6870929Z             scale_ub_tensor = None
2025-05-07T20:32:46.6871197Z     
2025-05-07T20:32:46.6871439Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:46.6871777Z             op = silu_mul_quant
2025-05-07T20:32:46.6872046Z             if compiled:
2025-05-07T20:32:46.6872362Z                 op = torch.compile(op)
2025-05-07T20:32:46.6872681Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.6872977Z     
2025-05-07T20:32:46.6873186Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:46.6873362Z 
2025-05-07T20:32:46.6873468Z moe/activation_test.py:117: 
2025-05-07T20:32:46.6873784Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.6874142Z moe/activation_test.py:115: in fn
2025-05-07T20:32:46.6874440Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:46.6875266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:46.6875990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:46.6876565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:46.6877325Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:46.6878069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:46.6878636Z     kernel = self.compile(
2025-05-07T20:32:46.6879207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:46.6879902Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:46.6880482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:46.6880727Z 
2025-05-07T20:32:46.6880955Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace6b5d0>
2025-05-07T20:32:46.6882082Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:46.6883528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07aca59580>}
2025-05-07T20:32:46.6884934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:46.6886007Z context = <triton._C.libtriton.ir.context object at 0x7f07ac9178b0>
2025-05-07T20:32:46.6886313Z 
2025-05-07T20:32:46.6886497Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:46.6887052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:46.6887552Z                            module_map=module_map)
2025-05-07T20:32:46.6887944Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:46.6888322Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:46.6888604Z E       ^
2025-05-07T20:32:46.6889110Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:46.6889583Z 
2025-05-07T20:32:46.6890026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:46.7712183Z 
2025-05-07T20:32:46.7712496Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7713211Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7714136Z     T=2048,
2025-05-07T20:32:46.7714440Z     D=5120,
2025-05-07T20:32:46.7714656Z     scale_ub=None,
2025-05-07T20:32:46.7714878Z     contiguous=True,
2025-05-07T20:32:46.7715116Z     compiled=False,
2025-05-07T20:32:46.7715336Z )
2025-05-07T20:32:46.7715667Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7716193Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.7716721Z 
2025-05-07T20:32:46.7716809Z     @given(
2025-05-07T20:32:46.7717059Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7717387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7717715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7718067Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7718413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7718721Z     )
2025-05-07T20:32:46.7719089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7719635Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7719885Z         self,
2025-05-07T20:32:46.7720232Z         T: int,
2025-05-07T20:32:46.7720447Z         D: int,
2025-05-07T20:32:46.7720674Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7720962Z         contiguous: bool,
2025-05-07T20:32:46.7721309Z         compiled: bool,
2025-05-07T20:32:46.7721543Z     ) -> None:
2025-05-07T20:32:46.7721776Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7722095Z     
2025-05-07T20:32:46.7722397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7722760Z     
2025-05-07T20:32:46.7722970Z >       x_sign = torch.sign(x)
2025-05-07T20:32:46.7724999Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.7726950Z 
2025-05-07T20:32:46.7727081Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:46.7727314Z 
2025-05-07T20:32:46.7727427Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7727986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7728499Z     T=16384,
2025-05-07T20:32:46.7728760Z     D=5120,
2025-05-07T20:32:46.7729236Z     scale_ub=None,
2025-05-07T20:32:46.7737100Z     contiguous=True,
2025-05-07T20:32:46.7737351Z     compiled=False,
2025-05-07T20:32:46.7737576Z )
2025-05-07T20:32:46.7737917Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7738461Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.7738801Z 
2025-05-07T20:32:46.7738903Z     @given(
2025-05-07T20:32:46.7739153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7739483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7739817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7740174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7740525Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7740834Z     )
2025-05-07T20:32:46.7741213Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7741686Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7741943Z         self,
2025-05-07T20:32:46.7742156Z         T: int,
2025-05-07T20:32:46.7742369Z         D: int,
2025-05-07T20:32:46.7742600Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7742894Z         contiguous: bool,
2025-05-07T20:32:46.7743158Z         compiled: bool,
2025-05-07T20:32:46.7743396Z     ) -> None:
2025-05-07T20:32:46.7743629Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7743890Z     
2025-05-07T20:32:46.7744175Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7746417Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.7748362Z 
2025-05-07T20:32:46.7748488Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.7748717Z 
2025-05-07T20:32:46.7748829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7749313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7749733Z     T=4096,
2025-05-07T20:32:46.7749933Z     D=5120,
2025-05-07T20:32:46.7750138Z     scale_ub=None,
2025-05-07T20:32:46.7750362Z     contiguous=True,
2025-05-07T20:32:46.7750603Z     compiled=False,
2025-05-07T20:32:46.7750865Z )
2025-05-07T20:32:46.7751196Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7751758Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.7752042Z 
2025-05-07T20:32:46.7752131Z     @given(
2025-05-07T20:32:46.7752369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7752704Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7753030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7753379Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7753722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7754030Z     )
2025-05-07T20:32:46.7754399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7754861Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7755120Z         self,
2025-05-07T20:32:46.7755332Z         T: int,
2025-05-07T20:32:46.7755541Z         D: int,
2025-05-07T20:32:46.7755776Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7756070Z         contiguous: bool,
2025-05-07T20:32:46.7756326Z         compiled: bool,
2025-05-07T20:32:46.7756562Z     ) -> None:
2025-05-07T20:32:46.7756792Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7757045Z     
2025-05-07T20:32:46.7757340Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7759460Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.7761478Z 
2025-05-07T20:32:46.7761609Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.7761836Z 
2025-05-07T20:32:46.7761954Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7762385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7762814Z     T=2048,
2025-05-07T20:32:46.7763014Z     D=5120,
2025-05-07T20:32:46.7763213Z     scale_ub=None,
2025-05-07T20:32:46.7763443Z     contiguous=False,
2025-05-07T20:32:46.7763687Z     compiled=False,
2025-05-07T20:32:46.7763899Z )
2025-05-07T20:32:46.7764235Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7764761Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:46.7765049Z 
2025-05-07T20:32:46.7765138Z     @given(
2025-05-07T20:32:46.7765378Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7765709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7766038Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7766431Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7766782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7767088Z     )
2025-05-07T20:32:46.7767451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7767922Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7768181Z         self,
2025-05-07T20:32:46.7768392Z         T: int,
2025-05-07T20:32:46.7768596Z         D: int,
2025-05-07T20:32:46.7768834Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7769130Z         contiguous: bool,
2025-05-07T20:32:46.7769423Z         compiled: bool,
2025-05-07T20:32:46.7769659Z     ) -> None:
2025-05-07T20:32:46.7769893Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7770145Z     
2025-05-07T20:32:46.7770433Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7772630Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.7774545Z 
2025-05-07T20:32:46.7774676Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.7774900Z 
2025-05-07T20:32:46.7775017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7775447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7775868Z     T=4096,
2025-05-07T20:32:46.7776074Z     D=7168,
2025-05-07T20:32:46.7776273Z     scale_ub=None,
2025-05-07T20:32:46.7776505Z     contiguous=True,
2025-05-07T20:32:46.7776745Z     compiled=True,
2025-05-07T20:32:46.7776962Z )
2025-05-07T20:32:46.7777302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.7777827Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:46.7778107Z 
2025-05-07T20:32:46.7778188Z     @given(
2025-05-07T20:32:46.7778437Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.7778794Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.7779151Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.7779496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.7779858Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.7780163Z     )
2025-05-07T20:32:46.7780527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.7780995Z     def test_silu_mul_quant(
2025-05-07T20:32:46.7781256Z         self,
2025-05-07T20:32:46.7781459Z         T: int,
2025-05-07T20:32:46.7781674Z         D: int,
2025-05-07T20:32:46.7781911Z         scale_ub: Optional[float],
2025-05-07T20:32:46.7782197Z         contiguous: bool,
2025-05-07T20:32:46.7782456Z         compiled: bool,
2025-05-07T20:32:46.7782696Z     ) -> None:
2025-05-07T20:32:46.7782923Z         torch.manual_seed(2025)
2025-05-07T20:32:46.7783184Z     
2025-05-07T20:32:46.7783476Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.7785603Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.7787573Z 
2025-05-07T20:32:46.7787709Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.7787932Z 
2025-05-07T20:32:46.7788041Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.7788476Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.7788923Z     T=2048,
2025-05-07T20:32:46.7789139Z     D=5120,
2025-05-07T20:32:46.7789345Z     scale_ub=1200.0,
2025-05-07T20:32:46.7789586Z     contiguous=False,
2025-05-07T20:32:46.7789830Z     compiled=False,
2025-05-07T20:32:46.8331976Z )
2025-05-07T20:32:46.8332739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.8333479Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:46.8333879Z 
2025-05-07T20:32:46.8333993Z     @given(
2025-05-07T20:32:46.8334300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.8334845Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.8335339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.8335688Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.8336044Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.8336353Z     )
2025-05-07T20:32:46.8336721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.8337194Z     def test_silu_mul_quant(
2025-05-07T20:32:46.8337461Z         self,
2025-05-07T20:32:46.8337673Z         T: int,
2025-05-07T20:32:46.8337880Z         D: int,
2025-05-07T20:32:46.8338121Z         scale_ub: Optional[float],
2025-05-07T20:32:46.8338413Z         contiguous: bool,
2025-05-07T20:32:46.8338679Z         compiled: bool,
2025-05-07T20:32:46.8338958Z     ) -> None:
2025-05-07T20:32:46.8339189Z         torch.manual_seed(2025)
2025-05-07T20:32:46.8339451Z     
2025-05-07T20:32:46.8339741Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.8341871Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.8343802Z 
2025-05-07T20:32:46.8343928Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.8344156Z 
2025-05-07T20:32:46.8344266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.8344708Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.8345134Z     T=4096,
2025-05-07T20:32:46.8345333Z     D=7168,
2025-05-07T20:32:46.8345543Z     scale_ub=1200.0,
2025-05-07T20:32:46.8345785Z     contiguous=True,
2025-05-07T20:32:46.8346086Z     compiled=False,
2025-05-07T20:32:46.8346401Z )
2025-05-07T20:32:46.8346860Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.8347488Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.8347780Z 
2025-05-07T20:32:46.8347861Z     @given(
2025-05-07T20:32:46.8348105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.8348431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.8348756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.8349106Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.8349454Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.8349749Z     )
2025-05-07T20:32:46.8350114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.8350579Z     def test_silu_mul_quant(
2025-05-07T20:32:46.8350831Z         self,
2025-05-07T20:32:46.8351133Z         T: int,
2025-05-07T20:32:46.8351350Z         D: int,
2025-05-07T20:32:46.8351580Z         scale_ub: Optional[float],
2025-05-07T20:32:46.8351873Z         contiguous: bool,
2025-05-07T20:32:46.8352126Z         compiled: bool,
2025-05-07T20:32:46.8352356Z     ) -> None:
2025-05-07T20:32:46.8352585Z         torch.manual_seed(2025)
2025-05-07T20:32:46.8352844Z     
2025-05-07T20:32:46.8353139Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.8355305Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.8357332Z 
2025-05-07T20:32:46.8357464Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.8357685Z 
2025-05-07T20:32:46.8357793Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.8358228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.8358654Z     T=16384,
2025-05-07T20:32:46.8358854Z     D=7168,
2025-05-07T20:32:46.8359058Z     scale_ub=None,
2025-05-07T20:32:46.8359291Z     contiguous=False,
2025-05-07T20:32:46.8359527Z     compiled=True,
2025-05-07T20:32:46.8359742Z )
2025-05-07T20:32:46.8360187Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.8360708Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:46.8361003Z 
2025-05-07T20:32:46.8361089Z     @given(
2025-05-07T20:32:46.8361340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.8361679Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.8361999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.8362345Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.8362691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.8362987Z     )
2025-05-07T20:32:46.8363358Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.8363820Z     def test_silu_mul_quant(
2025-05-07T20:32:46.8364070Z         self,
2025-05-07T20:32:46.8364274Z         T: int,
2025-05-07T20:32:46.8364487Z         D: int,
2025-05-07T20:32:46.8364720Z         scale_ub: Optional[float],
2025-05-07T20:32:46.8365000Z         contiguous: bool,
2025-05-07T20:32:46.8365252Z         compiled: bool,
2025-05-07T20:32:46.8365486Z     ) -> None:
2025-05-07T20:32:46.8365709Z         torch.manual_seed(2025)
2025-05-07T20:32:46.8365967Z     
2025-05-07T20:32:46.8366280Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.8368396Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.8370362Z 
2025-05-07T20:32:46.8370493Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.8370715Z 
2025-05-07T20:32:46.8370823Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.8371258Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.8371683Z     T=4096,
2025-05-07T20:32:46.8371926Z     D=7168,
2025-05-07T20:32:46.8372123Z     scale_ub=None,
2025-05-07T20:32:46.8372350Z     contiguous=True,
2025-05-07T20:32:46.8372589Z     compiled=False,
2025-05-07T20:32:46.8372799Z )
2025-05-07T20:32:46.8373136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.8373652Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.8373930Z 
2025-05-07T20:32:46.8374013Z     @given(
2025-05-07T20:32:46.8374255Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.8374584Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.8374948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.8375298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.8375645Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.8375951Z     )
2025-05-07T20:32:46.8376311Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.8376818Z     def test_silu_mul_quant(
2025-05-07T20:32:46.8377112Z         self,
2025-05-07T20:32:46.8377314Z         T: int,
2025-05-07T20:32:46.8377524Z         D: int,
2025-05-07T20:32:46.8377758Z         scale_ub: Optional[float],
2025-05-07T20:32:46.8378041Z         contiguous: bool,
2025-05-07T20:32:46.8378296Z         compiled: bool,
2025-05-07T20:32:46.8378537Z     ) -> None:
2025-05-07T20:32:46.8378761Z         torch.manual_seed(2025)
2025-05-07T20:32:46.8379026Z     
2025-05-07T20:32:46.8379318Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.8381448Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.8383369Z 
2025-05-07T20:32:46.8383498Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.8383722Z 
2025-05-07T20:32:46.8383829Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.8384266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.8384685Z     T=16384,
2025-05-07T20:32:46.8384882Z     D=7168,
2025-05-07T20:32:46.8385085Z     scale_ub=None,
2025-05-07T20:32:46.8385312Z     contiguous=True,
2025-05-07T20:32:46.8385540Z     compiled=False,
2025-05-07T20:32:46.8385755Z )
2025-05-07T20:32:46.8386087Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.8386600Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:46.8386900Z 
2025-05-07T20:32:46.8386982Z     @given(
2025-05-07T20:32:46.8387231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.8387559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.8387877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.8388222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.8388567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.8388860Z     )
2025-05-07T20:32:46.8389228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.8389687Z     def test_silu_mul_quant(
2025-05-07T20:32:46.8389936Z         self,
2025-05-07T20:32:46.8390140Z         T: int,
2025-05-07T20:32:46.8390346Z         D: int,
2025-05-07T20:32:46.8390568Z         scale_ub: Optional[float],
2025-05-07T20:32:46.8390853Z         contiguous: bool,
2025-05-07T20:32:46.8391106Z         compiled: bool,
2025-05-07T20:32:46.8391340Z     ) -> None:
2025-05-07T20:32:46.8391562Z         torch.manual_seed(2025)
2025-05-07T20:32:46.8391870Z     
2025-05-07T20:32:46.8392160Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.8394268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.8396228Z 
2025-05-07T20:32:46.8396352Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:46.8396577Z 
2025-05-07T20:32:46.8396687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:46.8397162Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:46.8397586Z     T=16384,
2025-05-07T20:32:46.8397820Z     D=7168,
2025-05-07T20:32:46.8398031Z     scale_ub=1200.0,
2025-05-07T20:32:46.8398265Z     contiguous=True,
2025-05-07T20:32:46.8398494Z     compiled=False,
2025-05-07T20:32:46.8398709Z )
2025-05-07T20:32:46.8399047Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:46.8399563Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:46.8399865Z 
2025-05-07T20:32:46.8399946Z     @given(
2025-05-07T20:32:46.8400270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:46.8400603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:46.8400931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:46.8401282Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:46.8401632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:46.8401932Z     )
2025-05-07T20:32:46.8402309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:46.8402775Z     def test_silu_mul_quant(
2025-05-07T20:32:46.8403024Z         self,
2025-05-07T20:32:46.8403231Z         T: int,
2025-05-07T20:32:46.8403440Z         D: int,
2025-05-07T20:32:46.8403666Z         scale_ub: Optional[float],
2025-05-07T20:32:46.8403955Z         contiguous: bool,
2025-05-07T20:32:46.8404212Z         compiled: bool,
2025-05-07T20:32:46.8404441Z     ) -> None:
2025-05-07T20:32:46.8404669Z         torch.manual_seed(2025)
2025-05-07T20:32:46.8404926Z     
2025-05-07T20:32:46.8405209Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:46.8407328Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:46.8409303Z 
2025-05-07T20:32:46.8409428Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.0223338Z 
2025-05-07T20:32:47.0224026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.0224743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.0225334Z     T=128,
2025-05-07T20:32:47.0225597Z     D=5120,
2025-05-07T20:32:47.0225801Z     scale_ub=1200.0,
2025-05-07T20:32:47.0226043Z     contiguous=False,
2025-05-07T20:32:47.0226287Z     compiled=False,
2025-05-07T20:32:47.0226504Z )
2025-05-07T20:32:47.0226846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.0227384Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.0227962Z 
2025-05-07T20:32:47.0228062Z     @given(
2025-05-07T20:32:47.0228306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.0228640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.0228998Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.0229366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.0229717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.0230024Z     )
2025-05-07T20:32:47.0230392Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.0230950Z     def test_silu_mul_quant(
2025-05-07T20:32:47.0231214Z         self,
2025-05-07T20:32:47.0231420Z         T: int,
2025-05-07T20:32:47.0231633Z         D: int,
2025-05-07T20:32:47.0231869Z         scale_ub: Optional[float],
2025-05-07T20:32:47.0232153Z         contiguous: bool,
2025-05-07T20:32:47.0232493Z         compiled: bool,
2025-05-07T20:32:47.0232738Z     ) -> None:
2025-05-07T20:32:47.0233044Z         torch.manual_seed(2025)
2025-05-07T20:32:47.0233299Z     
2025-05-07T20:32:47.0233589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.0233956Z     
2025-05-07T20:32:47.0234161Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.0234478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.0234814Z         x = x_sign * x_clamp
2025-05-07T20:32:47.0235065Z         x0 = x[:, :D]
2025-05-07T20:32:47.0235297Z         x1 = x[:, D:]
2025-05-07T20:32:47.0235520Z     
2025-05-07T20:32:47.0235716Z         if contiguous:
2025-05-07T20:32:47.0235964Z             x0 = x0.contiguous()
2025-05-07T20:32:47.0236238Z             x1 = x1.contiguous()
2025-05-07T20:32:47.0236488Z     
2025-05-07T20:32:47.0236695Z         if scale_ub is not None:
2025-05-07T20:32:47.0236987Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.0237348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.0237687Z             )
2025-05-07T20:32:47.0237900Z         else:
2025-05-07T20:32:47.0238120Z             scale_ub_tensor = None
2025-05-07T20:32:47.0238387Z     
2025-05-07T20:32:47.0238636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.0238975Z             op = silu_mul_quant
2025-05-07T20:32:47.0239236Z             if compiled:
2025-05-07T20:32:47.0239501Z                 op = torch.compile(op)
2025-05-07T20:32:47.0239816Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.0240201Z     
2025-05-07T20:32:47.0240410Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.0240587Z 
2025-05-07T20:32:47.0240701Z moe/activation_test.py:117: 
2025-05-07T20:32:47.0241012Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.0241367Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.0241670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.0242414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.0243155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.0243733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.0244464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.0245168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.0245737Z     kernel = self.compile(
2025-05-07T20:32:47.0246320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.0247019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.0247437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.0247692Z 
2025-05-07T20:32:47.0247964Z self = <triton.compiler.compiler.ASTSource object at 0x7f07aca12350>
2025-05-07T20:32:47.0257200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.0258670Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ac7b11c0>}
2025-05-07T20:32:47.0260076Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.0261281Z context = <triton._C.libtriton.ir.context object at 0x7f07ac863770>
2025-05-07T20:32:47.0261597Z 
2025-05-07T20:32:47.0261827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.0262440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.0262937Z                            module_map=module_map)
2025-05-07T20:32:47.0263334Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.0263716Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.0263999Z E       ^
2025-05-07T20:32:47.0264489Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.0264968Z 
2025-05-07T20:32:47.0265408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.0265949Z 
2025-05-07T20:32:47.0266069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.0266514Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.0266943Z     T=2048,
2025-05-07T20:32:47.0267151Z     D=7168,
2025-05-07T20:32:47.0267368Z     scale_ub=None,
2025-05-07T20:32:47.0267605Z     contiguous=False,
2025-05-07T20:32:47.0267854Z     compiled=False,
2025-05-07T20:32:47.0268077Z )
2025-05-07T20:32:47.0268412Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.0268991Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.0269278Z 
2025-05-07T20:32:47.0269370Z     @given(
2025-05-07T20:32:47.0269614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.0269952Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.0270289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.0270646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.0270995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.0271305Z     )
2025-05-07T20:32:47.0271686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.0272155Z     def test_silu_mul_quant(
2025-05-07T20:32:47.0272422Z         self,
2025-05-07T20:32:47.0272642Z         T: int,
2025-05-07T20:32:47.0272853Z         D: int,
2025-05-07T20:32:47.0273093Z         scale_ub: Optional[float],
2025-05-07T20:32:47.0273387Z         contiguous: bool,
2025-05-07T20:32:47.0273643Z         compiled: bool,
2025-05-07T20:32:47.0273884Z     ) -> None:
2025-05-07T20:32:47.0274120Z         torch.manual_seed(2025)
2025-05-07T20:32:47.0274377Z     
2025-05-07T20:32:47.0274678Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.0276897Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.0278823Z 
2025-05-07T20:32:47.0278954Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.0279180Z 
2025-05-07T20:32:47.0279298Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.0279734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.0280278Z     T=128,
2025-05-07T20:32:47.0280479Z     D=7168,
2025-05-07T20:32:47.0280678Z     scale_ub=1200.0,
2025-05-07T20:32:47.0280968Z     contiguous=True,
2025-05-07T20:32:47.0281207Z     compiled=True,
2025-05-07T20:32:47.0281418Z )
2025-05-07T20:32:47.0281754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.0282274Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.0282596Z 
2025-05-07T20:32:47.0282687Z     @given(
2025-05-07T20:32:47.0282930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.0283304Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.0283633Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.0283976Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.0284328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.0284634Z     )
2025-05-07T20:32:47.0285000Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.0285469Z     def test_silu_mul_quant(
2025-05-07T20:32:47.0285727Z         self,
2025-05-07T20:32:47.0285930Z         T: int,
2025-05-07T20:32:47.0286145Z         D: int,
2025-05-07T20:32:47.0286379Z         scale_ub: Optional[float],
2025-05-07T20:32:47.0286667Z         contiguous: bool,
2025-05-07T20:32:47.0286920Z         compiled: bool,
2025-05-07T20:32:47.0287158Z     ) -> None:
2025-05-07T20:32:47.0287388Z         torch.manual_seed(2025)
2025-05-07T20:32:47.0287641Z     
2025-05-07T20:32:47.0287935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.0288295Z     
2025-05-07T20:32:47.0288498Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.0288810Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.0289142Z         x = x_sign * x_clamp
2025-05-07T20:32:47.0289391Z         x0 = x[:, :D]
2025-05-07T20:32:47.0289623Z         x1 = x[:, D:]
2025-05-07T20:32:47.0289846Z     
2025-05-07T20:32:47.0290043Z         if contiguous:
2025-05-07T20:32:47.0290292Z             x0 = x0.contiguous()
2025-05-07T20:32:47.0290570Z             x1 = x1.contiguous()
2025-05-07T20:32:47.0290821Z     
2025-05-07T20:32:47.0291027Z         if scale_ub is not None:
2025-05-07T20:32:47.0291319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.0291672Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.0292001Z             )
2025-05-07T20:32:47.0292211Z         else:
2025-05-07T20:32:47.0292437Z             scale_ub_tensor = None
2025-05-07T20:32:47.0292703Z     
2025-05-07T20:32:47.0292955Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.0293293Z             op = silu_mul_quant
2025-05-07T20:32:47.0293553Z             if compiled:
2025-05-07T20:32:47.0293818Z                 op = torch.compile(op)
2025-05-07T20:32:47.0294136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.0294425Z     
2025-05-07T20:32:47.0294636Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.0294812Z 
2025-05-07T20:32:47.0294925Z moe/activation_test.py:117: 
2025-05-07T20:32:47.0295238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.0295597Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.0295902Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.0296491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.0297080Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.0297824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.0298548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.0299160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.0299877Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.0300577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.0301175Z     kernel = self.compile(
2025-05-07T20:32:47.0301737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.0302426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.0302943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.0303187Z 
2025-05-07T20:32:47.0303452Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acba6450>
2025-05-07T20:32:47.0304577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.0306005Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ac8a3b00>}
2025-05-07T20:32:47.0307403Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.0308474Z context = <triton._C.libtriton.ir.context object at 0x7f07ac6cbef0>
2025-05-07T20:32:47.0308779Z 
2025-05-07T20:32:47.0308960Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.0309523Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.0310018Z                            module_map=module_map)
2025-05-07T20:32:47.0310405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.0310774Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.0311053Z E       ^
2025-05-07T20:32:47.0311530Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.0312000Z 
2025-05-07T20:32:47.0312435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3039924Z 
2025-05-07T20:32:47.3040641Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3041932Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3043122Z     T=128,
2025-05-07T20:32:47.3043679Z     D=7168,
2025-05-07T20:32:47.3044234Z     scale_ub=1200.0,
2025-05-07T20:32:47.3044764Z     contiguous=True,
2025-05-07T20:32:47.3045231Z     compiled=False,
2025-05-07T20:32:47.3045663Z )
2025-05-07T20:32:47.3046333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3047364Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.3047926Z 
2025-05-07T20:32:47.3048088Z     @given(
2025-05-07T20:32:47.3048573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3049065Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3049412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3049769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3050122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3050426Z     )
2025-05-07T20:32:47.3051060Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3051533Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3051794Z         self,
2025-05-07T20:32:47.3051998Z         T: int,
2025-05-07T20:32:47.3052213Z         D: int,
2025-05-07T20:32:47.3052446Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3052731Z         contiguous: bool,
2025-05-07T20:32:47.3052993Z         compiled: bool,
2025-05-07T20:32:47.3053231Z     ) -> None:
2025-05-07T20:32:47.3053454Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3053711Z     
2025-05-07T20:32:47.3054002Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3054439Z     
2025-05-07T20:32:47.3054651Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3054960Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3057124Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.3059145Z 
2025-05-07T20:32:47.3059283Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.3059506Z 
2025-05-07T20:32:47.3059615Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3060054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3060480Z     T=128,
2025-05-07T20:32:47.3060675Z     D=5120,
2025-05-07T20:32:47.3060883Z     scale_ub=1200.0,
2025-05-07T20:32:47.3061130Z     contiguous=True,
2025-05-07T20:32:47.3061369Z     compiled=True,
2025-05-07T20:32:47.3061594Z )
2025-05-07T20:32:47.3061949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3062473Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.3062755Z 
2025-05-07T20:32:47.3062845Z     @given(
2025-05-07T20:32:47.3063094Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3063421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3063748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3064101Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3064445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3064757Z     )
2025-05-07T20:32:47.3065132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3065760Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3066131Z         self,
2025-05-07T20:32:47.3066439Z         T: int,
2025-05-07T20:32:47.3066719Z         D: int,
2025-05-07T20:32:47.3067040Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3067450Z         contiguous: bool,
2025-05-07T20:32:47.3067797Z         compiled: bool,
2025-05-07T20:32:47.3068120Z     ) -> None:
2025-05-07T20:32:47.3068431Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3068787Z     
2025-05-07T20:32:47.3069165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3069666Z     
2025-05-07T20:32:47.3069946Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3070355Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3073400Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.3076110Z 
2025-05-07T20:32:47.3076279Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.3076602Z 
2025-05-07T20:32:47.3076747Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3077344Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3077912Z     T=128,
2025-05-07T20:32:47.3078183Z     D=7168,
2025-05-07T20:32:47.3078461Z     scale_ub=None,
2025-05-07T20:32:47.3078764Z     contiguous=True,
2025-05-07T20:32:47.3079143Z     compiled=True,
2025-05-07T20:32:47.3079441Z )
2025-05-07T20:32:47.3079891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3080756Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3081142Z 
2025-05-07T20:32:47.3081261Z     @given(
2025-05-07T20:32:47.3081665Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3082121Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3082622Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3083108Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3083575Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3083988Z     )
2025-05-07T20:32:47.3084495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3085122Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3085467Z         self,
2025-05-07T20:32:47.3085743Z         T: int,
2025-05-07T20:32:47.3086021Z         D: int,
2025-05-07T20:32:47.3086332Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3086720Z         contiguous: bool,
2025-05-07T20:32:47.3087066Z         compiled: bool,
2025-05-07T20:32:47.3087378Z     ) -> None:
2025-05-07T20:32:47.3087689Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3088035Z     
2025-05-07T20:32:47.3088420Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3091391Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.3094117Z 
2025-05-07T20:32:47.3094285Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.3094594Z 
2025-05-07T20:32:47.3095568Z FAILED
2025-05-07T20:32:47.3095724Z 
2025-05-07T20:32:47.3095905Z =================================== FAILURES ===================================
2025-05-07T20:32:47.3096514Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:47.3097144Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:47.3098020Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:47.3098799Z   |     yield
2025-05-07T20:32:47.3099403Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:47.3099951Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:47.3100257Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:47.3100825Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:47.3101409Z   |     if method() is not None:
2025-05-07T20:32:47.3101677Z   |        ~~~~~~^^
2025-05-07T20:32:47.3102339Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:47.3103162Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3103475Z   |            ^^^^^^^
2025-05-07T20:32:47.3104066Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:47.3104711Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:47.3105165Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:47.3105611Z   +-+---------------- 1 ----------------
2025-05-07T20:32:47.3105915Z     | Traceback (most recent call last):
2025-05-07T20:32:47.3106705Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:47.3107516Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3109682Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.3111757Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:47.3112219Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3112651Z     |     T=2048,
2025-05-07T20:32:47.3112907Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:47.3113262Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:47.3114019Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:47.3114559Z     |     compiled=False,  # or any other generated value
2025-05-07T20:32:47.3115000Z     | )
2025-05-07T20:32:47.3115265Z     | 
2025-05-07T20:32:47.3116023Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:32:47.3116865Z     +---------------- 2 ----------------
2025-05-07T20:32:47.3117268Z     | Traceback (most recent call last):
2025-05-07T20:32:47.3118274Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:47.3119412Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3122465Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.3125325Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:47.3125955Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3126539Z     |     T=128,
2025-05-07T20:32:47.3126851Z     |     D=7168,
2025-05-07T20:32:47.3127156Z     |     scale_ub=None,
2025-05-07T20:32:47.3127519Z     |     contiguous=True,
2025-05-07T20:32:47.3127882Z     |     compiled=True,
2025-05-07T20:32:47.3128207Z     | )
2025-05-07T20:32:47.3128477Z     | 
2025-05-07T20:32:47.3129244Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:47.3130269Z     +---------------- 3 ----------------
2025-05-07T20:32:47.3130699Z     | Traceback (most recent call last):
2025-05-07T20:32:47.3131736Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:47.3132867Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3135817Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.3138847Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:47.3139610Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3140216Z     |     T=128,
2025-05-07T20:32:47.3140511Z     |     D=5120,
2025-05-07T20:32:47.3140809Z     |     scale_ub=1200.0,
2025-05-07T20:32:47.3141164Z     |     contiguous=True,
2025-05-07T20:32:47.3141517Z     |     compiled=True,
2025-05-07T20:32:47.3141839Z     | )
2025-05-07T20:32:47.3142102Z     | 
2025-05-07T20:32:47.3142863Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:47.3143740Z     +---------------- 4 ----------------
2025-05-07T20:32:47.3144163Z     | Traceback (most recent call last):
2025-05-07T20:32:47.3145198Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:47.3146236Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3146653Z     |                              ~~~~~~^^
2025-05-07T20:32:47.3147600Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:47.3148609Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3149811Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:47.3150988Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3151427Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:47.3151809Z     |         a,
2025-05-07T20:32:47.3152095Z     |         ^^
2025-05-07T20:32:47.3152391Z     |     ...<23 lines>...
2025-05-07T20:32:47.3152744Z     |         USE_INT64=use_int64,
2025-05-07T20:32:47.3153124Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.3153616Z     |     )
2025-05-07T20:32:47.3153990Z     |     ^
2025-05-07T20:32:47.3173637Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:47.3174681Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3175296Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.3176183Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:47.3177257Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3177912Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.3178786Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:47.3179857Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3180385Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:47.3181200Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:47.3181958Z     |     fn()
2025-05-07T20:32:47.3182229Z     |     ~~^^
2025-05-07T20:32:47.3183007Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:47.3183917Z     |     self.fn.run(
2025-05-07T20:32:47.3184226Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:47.3184525Z     |         *args,
2025-05-07T20:32:47.3184811Z     |         ^^^^^^
2025-05-07T20:32:47.3185119Z     |         **current,
2025-05-07T20:32:47.3185449Z     |         ^^^^^^^^^^
2025-05-07T20:32:47.3185840Z     |     )
2025-05-07T20:32:47.3186113Z     |     ^
2025-05-07T20:32:47.3186908Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:47.3187747Z     |     kernel = self.compile(
2025-05-07T20:32:47.3188113Z     |         src,
2025-05-07T20:32:47.3188438Z     |         target=target,
2025-05-07T20:32:47.3188815Z     |         options=options.__dict__,
2025-05-07T20:32:47.3189185Z     |     )
2025-05-07T20:32:47.3189933Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:47.3190911Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3191883Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:47.3192966Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3193624Z     |                        module_map=module_map)
2025-05-07T20:32:47.3194125Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3194597Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3194963Z     | ^
2025-05-07T20:32:47.3195590Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3196365Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:47.3196911Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:47.3197618Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3198217Z     |     T=1,  # or any other generated value
2025-05-07T20:32:47.3198636Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:47.3199162Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:47.3199677Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:47.3200315Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:47.3200763Z     | )
2025-05-07T20:32:47.3201031Z     | 
2025-05-07T20:32:47.3201785Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:47.3202649Z     +------------------------------------
2025-05-07T20:32:47.3203180Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:47.3203733Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3204323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3204913Z     T=1,
2025-05-07T20:32:47.3205188Z     D=5120,
2025-05-07T20:32:47.3205464Z     scale_ub=None,
2025-05-07T20:32:47.3205795Z     contiguous=True,
2025-05-07T20:32:47.3206139Z     compiled=True,
2025-05-07T20:32:47.3206435Z )
2025-05-07T20:32:47.3206922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3207713Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3208098Z 
2025-05-07T20:32:47.3208220Z     @given(
2025-05-07T20:32:47.3208552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3209026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3209473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3209953Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3210434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3210853Z     )
2025-05-07T20:32:47.3211406Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3212040Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3212403Z         self,
2025-05-07T20:32:47.3212693Z         T: int,
2025-05-07T20:32:47.3212978Z         D: int,
2025-05-07T20:32:47.3213294Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3214036Z         contiguous: bool,
2025-05-07T20:32:47.3214391Z         compiled: bool,
2025-05-07T20:32:47.3214914Z     ) -> None:
2025-05-07T20:32:47.3215238Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3215571Z     
2025-05-07T20:32:47.3215950Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3216424Z     
2025-05-07T20:32:47.3216691Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3217095Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3217532Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3217858Z         x0 = x[:, :D]
2025-05-07T20:32:47.3218168Z         x1 = x[:, D:]
2025-05-07T20:32:47.3218460Z     
2025-05-07T20:32:47.3218712Z         if contiguous:
2025-05-07T20:32:47.3219040Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3219401Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3219736Z     
2025-05-07T20:32:47.3220011Z         if scale_ub is not None:
2025-05-07T20:32:47.3220406Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3220882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3221312Z             )
2025-05-07T20:32:47.3221588Z         else:
2025-05-07T20:32:47.3221891Z             scale_ub_tensor = None
2025-05-07T20:32:47.3222241Z     
2025-05-07T20:32:47.3222568Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3223021Z             op = silu_mul_quant
2025-05-07T20:32:47.3223363Z             if compiled:
2025-05-07T20:32:47.3223711Z                 op = torch.compile(op)
2025-05-07T20:32:47.3224143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3224558Z     
2025-05-07T20:32:47.3224847Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3225257Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3225662Z     
2025-05-07T20:32:47.3226003Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3226477Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3226897Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3227339Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3227864Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3228331Z     
2025-05-07T20:32:47.3228626Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3228909Z 
2025-05-07T20:32:47.3229049Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3229465Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3229925Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3230381Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3231480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3232518Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3233356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3234304Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3235251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3236250Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3237258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3238146Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3239062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3239775Z     fn()
2025-05-07T20:32:47.3240580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3241471Z     self.fn.run(
2025-05-07T20:32:47.3242166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3242898Z     kernel = self.compile(
2025-05-07T20:32:47.3243647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3244552Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3245099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3245433Z 
2025-05-07T20:32:47.3245745Z self = <triton.compiler.compiler.ASTSource object at 0x7f08a163a270>
2025-05-07T20:32:47.3247306Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3249314Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f089ac76700>}
2025-05-07T20:32:47.3251254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3252737Z context = <triton._C.libtriton.ir.context object at 0x7f089b1f8ef0>
2025-05-07T20:32:47.3253147Z 
2025-05-07T20:32:47.3253385Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3254160Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3254850Z                            module_map=module_map)
2025-05-07T20:32:47.3255381Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3255907Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3256309Z E       ^
2025-05-07T20:32:47.3256971Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3257617Z 
2025-05-07T20:32:47.3258212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3258971Z 
2025-05-07T20:32:47.3259118Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3259719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3260287Z     T=2048,
2025-05-07T20:32:47.3260567Z     D=5120,
2025-05-07T20:32:47.3260854Z     scale_ub=1200.0,
2025-05-07T20:32:47.3261160Z     contiguous=True,
2025-05-07T20:32:47.3261494Z     compiled=False,
2025-05-07T20:32:47.3261799Z )
2025-05-07T20:32:47.3262270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3262970Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.3263371Z 
2025-05-07T20:32:47.3263499Z     @given(
2025-05-07T20:32:47.3263914Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3264356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3264813Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3265307Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3265796Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3266224Z     )
2025-05-07T20:32:47.3266741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3267389Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3267799Z         self,
2025-05-07T20:32:47.3268075Z         T: int,
2025-05-07T20:32:47.3268353Z         D: int,
2025-05-07T20:32:47.3268655Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3269034Z         contiguous: bool,
2025-05-07T20:32:47.3269378Z         compiled: bool,
2025-05-07T20:32:47.3269749Z     ) -> None:
2025-05-07T20:32:47.3270064Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3270416Z     
2025-05-07T20:32:47.3270858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3271356Z     
2025-05-07T20:32:47.3271636Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3272053Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3272493Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3272841Z         x0 = x[:, :D]
2025-05-07T20:32:47.3273148Z         x1 = x[:, D:]
2025-05-07T20:32:47.3273454Z     
2025-05-07T20:32:47.3273720Z         if contiguous:
2025-05-07T20:32:47.3274046Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3274431Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3274781Z     
2025-05-07T20:32:47.3275058Z         if scale_ub is not None:
2025-05-07T20:32:47.3275447Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3275925Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3276387Z             )
2025-05-07T20:32:47.3276673Z         else:
2025-05-07T20:32:47.3276977Z             scale_ub_tensor = None
2025-05-07T20:32:47.3277359Z     
2025-05-07T20:32:47.3277702Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3278161Z             op = silu_mul_quant
2025-05-07T20:32:47.3278532Z             if compiled:
2025-05-07T20:32:47.3278908Z                 op = torch.compile(op)
2025-05-07T20:32:47.3279340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3279736Z     
2025-05-07T20:32:47.3280010Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3280369Z 
2025-05-07T20:32:47.3280529Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3280907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3281265Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3281568Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3282293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3283030Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3283602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3284330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3285028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3285593Z     kernel = self.compile(
2025-05-07T20:32:47.3286169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3286863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3287294Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3287543Z 
2025-05-07T20:32:47.3287766Z self = <triton.compiler.compiler.ASTSource object at 0x7f089ac09090>
2025-05-07T20:32:47.3288988Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3290430Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f089ab2a020>}
2025-05-07T20:32:47.3291830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3292945Z context = <triton._C.libtriton.ir.context object at 0x7f089b1e8a30>
2025-05-07T20:32:47.3293249Z 
2025-05-07T20:32:47.3293434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3294039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3294570Z                            module_map=module_map)
2025-05-07T20:32:47.3294964Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3295345Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3295620Z E       ^
2025-05-07T20:32:47.3296118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3296591Z 
2025-05-07T20:32:47.3297033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3297571Z 
2025-05-07T20:32:47.3297687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3298124Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3298558Z     T=2048,
2025-05-07T20:32:47.3298763Z     D=5120,
2025-05-07T20:32:47.3298974Z     scale_ub=1200.0,
2025-05-07T20:32:47.3299217Z     contiguous=True,
2025-05-07T20:32:47.3299464Z     compiled=True,
2025-05-07T20:32:47.3299680Z )
2025-05-07T20:32:47.3300019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3300547Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.3300833Z 
2025-05-07T20:32:47.3300922Z     @given(
2025-05-07T20:32:47.3301164Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3301501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3301832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3302184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3302538Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3302844Z     )
2025-05-07T20:32:47.3303211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3303680Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3303944Z         self,
2025-05-07T20:32:47.3304154Z         T: int,
2025-05-07T20:32:47.3304365Z         D: int,
2025-05-07T20:32:47.3304605Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3304892Z         contiguous: bool,
2025-05-07T20:32:47.3305153Z         compiled: bool,
2025-05-07T20:32:47.3305397Z     ) -> None:
2025-05-07T20:32:47.3305624Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3305885Z     
2025-05-07T20:32:47.3306180Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3306542Z     
2025-05-07T20:32:47.3306746Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3307065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3307396Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3307649Z         x0 = x[:, :D]
2025-05-07T20:32:47.3307883Z         x1 = x[:, D:]
2025-05-07T20:32:47.3308107Z     
2025-05-07T20:32:47.3308303Z         if contiguous:
2025-05-07T20:32:47.3308552Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3308832Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3309132Z     
2025-05-07T20:32:47.3309348Z         if scale_ub is not None:
2025-05-07T20:32:47.3309647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3310003Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3310335Z             )
2025-05-07T20:32:47.3310545Z         else:
2025-05-07T20:32:47.3310767Z             scale_ub_tensor = None
2025-05-07T20:32:47.3311037Z     
2025-05-07T20:32:47.3311286Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3311624Z             op = silu_mul_quant
2025-05-07T20:32:47.3311938Z             if compiled:
2025-05-07T20:32:47.3312206Z                 op = torch.compile(op)
2025-05-07T20:32:47.3312527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3312817Z     
2025-05-07T20:32:47.3313029Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3313634Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3314143Z     
2025-05-07T20:32:47.3314402Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3314838Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3315150Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3315488Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3315874Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3316208Z     
2025-05-07T20:32:47.3316421Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3316639Z 
2025-05-07T20:32:47.3316746Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3317065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3317421Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3317772Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3318606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3319449Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3320026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3320831Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3321561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3322317Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3323094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3323776Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3324414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3324959Z     fn()
2025-05-07T20:32:47.3325498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3326111Z     self.fn.run(
2025-05-07T20:32:47.3326598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3327160Z     kernel = self.compile(
2025-05-07T20:32:47.3327738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3328434Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3328884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3329157Z 
2025-05-07T20:32:47.3329374Z self = <triton.compiler.compiler.ASTSource object at 0x7f089ac0a0d0>
2025-05-07T20:32:47.3330585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3332031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f089ab34720>}
2025-05-07T20:32:47.3333423Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3334492Z context = <triton._C.libtriton.ir.context object at 0x7f08998ee3b0>
2025-05-07T20:32:47.3334860Z 
2025-05-07T20:32:47.3335038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3335592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3336081Z                            module_map=module_map)
2025-05-07T20:32:47.3336520Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3336940Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3337231Z E       ^
2025-05-07T20:32:47.3337719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3338196Z 
2025-05-07T20:32:47.3338634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3339173Z 
2025-05-07T20:32:47.3339293Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3339734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3340162Z     T=16384,
2025-05-07T20:32:47.3340368Z     D=7168,
2025-05-07T20:32:47.3340579Z     scale_ub=1200.0,
2025-05-07T20:32:47.3340813Z     contiguous=False,
2025-05-07T20:32:47.3341058Z     compiled=False,
2025-05-07T20:32:47.3341283Z )
2025-05-07T20:32:47.3341618Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3342158Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3342460Z 
2025-05-07T20:32:47.3342551Z     @given(
2025-05-07T20:32:47.3342798Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3343136Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3343467Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3343816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3344169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3344479Z     )
2025-05-07T20:32:47.3344853Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3345317Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3345581Z         self,
2025-05-07T20:32:47.3345795Z         T: int,
2025-05-07T20:32:47.3345999Z         D: int,
2025-05-07T20:32:47.3346240Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3346537Z         contiguous: bool,
2025-05-07T20:32:47.3346793Z         compiled: bool,
2025-05-07T20:32:47.3347032Z     ) -> None:
2025-05-07T20:32:47.3347262Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3347512Z     
2025-05-07T20:32:47.3347808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3348169Z     
2025-05-07T20:32:47.3348371Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3348683Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3349012Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3349268Z         x0 = x[:, :D]
2025-05-07T20:32:47.3349498Z         x1 = x[:, D:]
2025-05-07T20:32:47.3349723Z     
2025-05-07T20:32:47.3349922Z         if contiguous:
2025-05-07T20:32:47.3350163Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3350440Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3350700Z     
2025-05-07T20:32:47.3350902Z         if scale_ub is not None:
2025-05-07T20:32:47.3351245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3351608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3351933Z             )
2025-05-07T20:32:47.3352142Z         else:
2025-05-07T20:32:47.3352366Z             scale_ub_tensor = None
2025-05-07T20:32:47.3352628Z     
2025-05-07T20:32:47.3352876Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3353211Z             op = silu_mul_quant
2025-05-07T20:32:47.3353473Z             if compiled:
2025-05-07T20:32:47.3353741Z                 op = torch.compile(op)
2025-05-07T20:32:47.3354059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3354391Z     
2025-05-07T20:32:47.3354600Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3354779Z 
2025-05-07T20:32:47.3354894Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3362587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3363035Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3363347Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3364121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3364849Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3365420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3366136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3366842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3367406Z     kernel = self.compile(
2025-05-07T20:32:47.3367982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3368667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3369095Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3369342Z 
2025-05-07T20:32:47.3369567Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899bb1220>
2025-05-07T20:32:47.3370689Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3372123Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899a13880>}
2025-05-07T20:32:47.3373520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3374586Z context = <triton._C.libtriton.ir.context object at 0x7f08995224b0>
2025-05-07T20:32:47.3374894Z 
2025-05-07T20:32:47.3375083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3375631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3376134Z                            module_map=module_map)
2025-05-07T20:32:47.3376524Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3376895Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3377175Z E       ^
2025-05-07T20:32:47.3377670Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3378142Z 
2025-05-07T20:32:47.3378583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3379117Z 
2025-05-07T20:32:47.3379233Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3379675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3380151Z     T=1,
2025-05-07T20:32:47.3380356Z     D=7168,
2025-05-07T20:32:47.3380560Z     scale_ub=None,
2025-05-07T20:32:47.3380794Z     contiguous=True,
2025-05-07T20:32:47.3381034Z     compiled=True,
2025-05-07T20:32:47.3381248Z )
2025-05-07T20:32:47.3381590Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3382107Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3382380Z 
2025-05-07T20:32:47.3382462Z     @given(
2025-05-07T20:32:47.3382713Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3383098Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3383424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3383786Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3384145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3384541Z     )
2025-05-07T20:32:47.3384911Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3385426Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3385688Z         self,
2025-05-07T20:32:47.3385892Z         T: int,
2025-05-07T20:32:47.3386106Z         D: int,
2025-05-07T20:32:47.3386345Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3386630Z         contiguous: bool,
2025-05-07T20:32:47.3386893Z         compiled: bool,
2025-05-07T20:32:47.3387136Z     ) -> None:
2025-05-07T20:32:47.3387365Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3387630Z     
2025-05-07T20:32:47.3387929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3388292Z     
2025-05-07T20:32:47.3388515Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3388832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3389194Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3389464Z         x0 = x[:, :D]
2025-05-07T20:32:47.3389708Z         x1 = x[:, D:]
2025-05-07T20:32:47.3389934Z     
2025-05-07T20:32:47.3390136Z         if contiguous:
2025-05-07T20:32:47.3390389Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3390673Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3390924Z     
2025-05-07T20:32:47.3391137Z         if scale_ub is not None:
2025-05-07T20:32:47.3391434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3391791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3392121Z             )
2025-05-07T20:32:47.3392331Z         else:
2025-05-07T20:32:47.3392430Z             scale_ub_tensor = None
2025-05-07T20:32:47.3392512Z     
2025-05-07T20:32:47.3392659Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3392758Z             op = silu_mul_quant
2025-05-07T20:32:47.3392850Z             if compiled:
2025-05-07T20:32:47.3392962Z                 op = torch.compile(op)
2025-05-07T20:32:47.3393074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3393164Z     
2025-05-07T20:32:47.3393263Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3393395Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3393479Z     
2025-05-07T20:32:47.3393624Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3393733Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3393851Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3393980Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3394130Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3394218Z     
2025-05-07T20:32:47.3394325Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3394330Z 
2025-05-07T20:32:47.3394448Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3394589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3394703Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3394856Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3395493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3395605Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3395990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3396229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3396622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3396962Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3397358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3397583Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3397987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3398071Z     fn()
2025-05-07T20:32:47.3398500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3398589Z     self.fn.run(
2025-05-07T20:32:47.3398950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3399052Z     kernel = self.compile(
2025-05-07T20:32:47.3399451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3399652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3399789Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3399794Z 
2025-05-07T20:32:47.3400020Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899bb3950>
2025-05-07T20:32:47.3400919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3401447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899ace480>}
2025-05-07T20:32:47.3402230Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3402435Z context = <triton._C.libtriton.ir.context object at 0x7f08994ac3b0>
2025-05-07T20:32:47.3402440Z 
2025-05-07T20:32:47.3402626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3402912Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3403030Z                            module_map=module_map)
2025-05-07T20:32:47.3403214Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3403323Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3403413Z E       ^
2025-05-07T20:32:47.3403784Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3403789Z 
2025-05-07T20:32:47.3404223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3404230Z 
2025-05-07T20:32:47.3404350Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3404589Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3404673Z     T=4096,
2025-05-07T20:32:47.3404767Z     D=5120,
2025-05-07T20:32:47.3404857Z     scale_ub=None,
2025-05-07T20:32:47.3405006Z     contiguous=False,
2025-05-07T20:32:47.3405099Z     compiled=False,
2025-05-07T20:32:47.3405179Z )
2025-05-07T20:32:47.3405418Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3405605Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.3405610Z 
2025-05-07T20:32:47.3405691Z     @given(
2025-05-07T20:32:47.3405830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3405937Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3406065Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3406241Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3406366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3406454Z     )
2025-05-07T20:32:47.3406712Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3406854Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3406950Z         self,
2025-05-07T20:32:47.3407036Z         T: int,
2025-05-07T20:32:47.3407157Z         D: int,
2025-05-07T20:32:47.3407271Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3407369Z         contiguous: bool,
2025-05-07T20:32:47.3407461Z         compiled: bool,
2025-05-07T20:32:47.3407552Z     ) -> None:
2025-05-07T20:32:47.3407654Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3407732Z     
2025-05-07T20:32:47.3407919Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3408000Z     
2025-05-07T20:32:47.3408104Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3408240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3408335Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3408428Z         x0 = x[:, :D]
2025-05-07T20:32:47.3408515Z         x1 = x[:, D:]
2025-05-07T20:32:47.3408593Z     
2025-05-07T20:32:47.3408689Z         if contiguous:
2025-05-07T20:32:47.3408791Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3408892Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3408980Z     
2025-05-07T20:32:47.3409082Z         if scale_ub is not None:
2025-05-07T20:32:47.3409197Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3409347Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3409429Z             )
2025-05-07T20:32:47.3409519Z         else:
2025-05-07T20:32:47.3409620Z             scale_ub_tensor = None
2025-05-07T20:32:47.3409697Z     
2025-05-07T20:32:47.3409842Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3409939Z             op = silu_mul_quant
2025-05-07T20:32:47.3410033Z             if compiled:
2025-05-07T20:32:47.3410149Z                 op = torch.compile(op)
2025-05-07T20:32:47.3410261Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3410341Z     
2025-05-07T20:32:47.3410447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3410455Z 
2025-05-07T20:32:47.3410561Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3410703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3410818Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3410925Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3411453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3411558Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3411937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3412186Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3412544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3412652Z     kernel = self.compile(
2025-05-07T20:32:47.3413102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3413296Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3413709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3413717Z 
2025-05-07T20:32:47.3413959Z self = <triton.compiler.compiler.ASTSource object at 0x7f08991d0cb0>
2025-05-07T20:32:47.3414771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3415402Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899af23e0>}
2025-05-07T20:32:47.3416182Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3416511Z context = <triton._C.libtriton.ir.context object at 0x7f08994d3eb0>
2025-05-07T20:32:47.3416516Z 
2025-05-07T20:32:47.3416697Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3416981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3417098Z                            module_map=module_map)
2025-05-07T20:32:47.3417271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3417391Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3417475Z E       ^
2025-05-07T20:32:47.3417846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3417851Z 
2025-05-07T20:32:47.3418293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3418304Z 
2025-05-07T20:32:47.3418417Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3418663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3418748Z     T=4096,
2025-05-07T20:32:47.3418850Z     D=7168,
2025-05-07T20:32:47.3418956Z     scale_ub=None,
2025-05-07T20:32:47.3419072Z     contiguous=False,
2025-05-07T20:32:47.3419163Z     compiled=False,
2025-05-07T20:32:47.3419250Z )
2025-05-07T20:32:47.3419482Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3419679Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.3419686Z 
2025-05-07T20:32:47.3419773Z     @given(
2025-05-07T20:32:47.3419902Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3420018Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3420140Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3420273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3420404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3420484Z     )
2025-05-07T20:32:47.3420744Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3420850Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3420937Z         self,
2025-05-07T20:32:47.3421025Z         T: int,
2025-05-07T20:32:47.3421108Z         D: int,
2025-05-07T20:32:47.3421215Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3421319Z         contiguous: bool,
2025-05-07T20:32:47.3421413Z         compiled: bool,
2025-05-07T20:32:47.3421499Z     ) -> None:
2025-05-07T20:32:47.3421611Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3421689Z     
2025-05-07T20:32:47.3421872Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3421958Z     
2025-05-07T20:32:47.3422056Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3422190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3422363Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3422450Z         x0 = x[:, :D]
2025-05-07T20:32:47.3422538Z         x1 = x[:, D:]
2025-05-07T20:32:47.3422622Z     
2025-05-07T20:32:47.3422714Z         if contiguous:
2025-05-07T20:32:47.3422811Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3422911Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3422988Z     
2025-05-07T20:32:47.3423085Z         if scale_ub is not None:
2025-05-07T20:32:47.3423203Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3423348Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3423471Z             )
2025-05-07T20:32:47.3423559Z         else:
2025-05-07T20:32:47.3423659Z             scale_ub_tensor = None
2025-05-07T20:32:47.3423742Z     
2025-05-07T20:32:47.3423879Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3424016Z             op = silu_mul_quant
2025-05-07T20:32:47.3424111Z             if compiled:
2025-05-07T20:32:47.3424260Z                 op = torch.compile(op)
2025-05-07T20:32:47.3424374Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3424457Z     
2025-05-07T20:32:47.3424553Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3424557Z 
2025-05-07T20:32:47.3424661Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3424802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3424909Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3425020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3425540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3425647Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3426030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3426270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3426627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3426733Z     kernel = self.compile(
2025-05-07T20:32:47.3427136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3427329Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3427464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3427471Z 
2025-05-07T20:32:47.3427686Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899b3ecf0>
2025-05-07T20:32:47.3428495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3429026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0899af2700>}
2025-05-07T20:32:47.3429808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3430009Z context = <triton._C.libtriton.ir.context object at 0x7f0898ae3ab0>
2025-05-07T20:32:47.3430014Z 
2025-05-07T20:32:47.3430194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3430473Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3430587Z                            module_map=module_map)
2025-05-07T20:32:47.3430761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3430867Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3430949Z E       ^
2025-05-07T20:32:47.3431394Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3431399Z 
2025-05-07T20:32:47.3431830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3431834Z 
2025-05-07T20:32:47.3431950Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3432185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3432267Z     T=128,
2025-05-07T20:32:47.3432399Z     D=7168,
2025-05-07T20:32:47.3432486Z     scale_ub=None,
2025-05-07T20:32:47.3432578Z     contiguous=False,
2025-05-07T20:32:47.3432670Z     compiled=True,
2025-05-07T20:32:47.3432747Z )
2025-05-07T20:32:47.3432976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3433162Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3433206Z 
2025-05-07T20:32:47.3433292Z     @given(
2025-05-07T20:32:47.3433462Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3433569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3433692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3433822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3433944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3434023Z     )
2025-05-07T20:32:47.3434288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3434390Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3434483Z         self,
2025-05-07T20:32:47.3434564Z         T: int,
2025-05-07T20:32:47.3434652Z         D: int,
2025-05-07T20:32:47.3434756Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3434851Z         contiguous: bool,
2025-05-07T20:32:47.3434954Z         compiled: bool,
2025-05-07T20:32:47.3435043Z     ) -> None:
2025-05-07T20:32:47.3435146Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3435231Z     
2025-05-07T20:32:47.3435410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3435489Z     
2025-05-07T20:32:47.3435591Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3435723Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3435817Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3435908Z         x0 = x[:, :D]
2025-05-07T20:32:47.3435993Z         x1 = x[:, D:]
2025-05-07T20:32:47.3436077Z     
2025-05-07T20:32:47.3436166Z         if contiguous:
2025-05-07T20:32:47.3436265Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3436368Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3436445Z     
2025-05-07T20:32:47.3436541Z         if scale_ub is not None:
2025-05-07T20:32:47.3436657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3436799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3436882Z             )
2025-05-07T20:32:47.3436973Z         else:
2025-05-07T20:32:47.3437077Z             scale_ub_tensor = None
2025-05-07T20:32:47.3437154Z     
2025-05-07T20:32:47.3437296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3437391Z             op = silu_mul_quant
2025-05-07T20:32:47.3437485Z             if compiled:
2025-05-07T20:32:47.3437591Z                 op = torch.compile(op)
2025-05-07T20:32:47.3437704Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3437787Z     
2025-05-07T20:32:47.3437882Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3438012Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3438097Z     
2025-05-07T20:32:47.3438241Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3438349Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3438461Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3438591Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3438790Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3438875Z     
2025-05-07T20:32:47.3438983Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3438988Z 
2025-05-07T20:32:47.3439099Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3439235Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3439344Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3439489Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3440067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3440273Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3440656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3440938Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3441362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3441635Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3442027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3442208Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3442567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3442656Z     fn()
2025-05-07T20:32:47.3443075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3443163Z     self.fn.run(
2025-05-07T20:32:47.3443521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3443626Z     kernel = self.compile(
2025-05-07T20:32:47.3444026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3444217Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3444352Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3444356Z 
2025-05-07T20:32:47.3444575Z self = <triton.compiler.compiler.ASTSource object at 0x7f089918d9d0>
2025-05-07T20:32:47.3445379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3445906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08991ab880>}
2025-05-07T20:32:47.3446688Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3446890Z context = <triton._C.libtriton.ir.context object at 0x7f0898dee7f0>
2025-05-07T20:32:47.3446895Z 
2025-05-07T20:32:47.3447076Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3447354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3447476Z                            module_map=module_map)
2025-05-07T20:32:47.3447645Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3447753Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3447839Z E       ^
2025-05-07T20:32:47.3448210Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3448261Z 
2025-05-07T20:32:47.3448700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3448705Z 
2025-05-07T20:32:47.3448834Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3449105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3449194Z     T=128,
2025-05-07T20:32:47.3449275Z     D=7168,
2025-05-07T20:32:47.3449362Z     scale_ub=None,
2025-05-07T20:32:47.3449463Z     contiguous=False,
2025-05-07T20:32:47.3449553Z     compiled=False,
2025-05-07T20:32:47.3449674Z )
2025-05-07T20:32:47.3449909Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3450091Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.3450095Z 
2025-05-07T20:32:47.3450177Z     @given(
2025-05-07T20:32:47.3450353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3450460Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3450626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3450752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3450875Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3450960Z     )
2025-05-07T20:32:47.3451220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3451321Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3451409Z         self,
2025-05-07T20:32:47.3451490Z         T: int,
2025-05-07T20:32:47.3451573Z         D: int,
2025-05-07T20:32:47.3451683Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3451778Z         contiguous: bool,
2025-05-07T20:32:47.3451869Z         compiled: bool,
2025-05-07T20:32:47.3451958Z     ) -> None:
2025-05-07T20:32:47.3452059Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3452141Z     
2025-05-07T20:32:47.3452326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3452404Z     
2025-05-07T20:32:47.3452508Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3452640Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3452733Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3452822Z         x0 = x[:, :D]
2025-05-07T20:32:47.3452908Z         x1 = x[:, D:]
2025-05-07T20:32:47.3452985Z     
2025-05-07T20:32:47.3453080Z         if contiguous:
2025-05-07T20:32:47.3453177Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3453272Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3453360Z     
2025-05-07T20:32:47.3453460Z         if scale_ub is not None:
2025-05-07T20:32:47.3453581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3453723Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3453803Z             )
2025-05-07T20:32:47.3453890Z         else:
2025-05-07T20:32:47.3453996Z             scale_ub_tensor = None
2025-05-07T20:32:47.3454074Z     
2025-05-07T20:32:47.3454224Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3454320Z             op = silu_mul_quant
2025-05-07T20:32:47.3454411Z             if compiled:
2025-05-07T20:32:47.3454522Z                 op = torch.compile(op)
2025-05-07T20:32:47.3454633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3454710Z     
2025-05-07T20:32:47.3454810Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3454814Z 
2025-05-07T20:32:47.3454918Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3455061Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3455170Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3455275Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3455798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3455902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3456332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3456576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3456934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3457038Z     kernel = self.compile(
2025-05-07T20:32:47.3457439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3457624Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3457807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3457812Z 
2025-05-07T20:32:47.3458027Z self = <triton.compiler.compiler.ASTSource object at 0x7f08995e0e50>
2025-05-07T20:32:47.3458882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3459467Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993fd580>}
2025-05-07T20:32:47.3460239Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3460447Z context = <triton._C.libtriton.ir.context object at 0x7f0899229530>
2025-05-07T20:32:47.3460452Z 
2025-05-07T20:32:47.3460627Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3460910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3461027Z                            module_map=module_map)
2025-05-07T20:32:47.3461201Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3461316Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3461398Z E       ^
2025-05-07T20:32:47.3461768Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3461779Z 
2025-05-07T20:32:47.3462212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3462217Z 
2025-05-07T20:32:47.3462330Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3462573Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3462655Z     T=4096,
2025-05-07T20:32:47.3462735Z     D=5120,
2025-05-07T20:32:47.3462830Z     scale_ub=1200.0,
2025-05-07T20:32:47.3462920Z     contiguous=True,
2025-05-07T20:32:47.3463011Z     compiled=False,
2025-05-07T20:32:47.3463093Z )
2025-05-07T20:32:47.3463329Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3463519Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.3463523Z 
2025-05-07T20:32:47.3463607Z     @given(
2025-05-07T20:32:47.3463732Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3463844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3463966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3464090Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3464213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3464295Z     )
2025-05-07T20:32:47.3464558Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3464657Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3464737Z         self,
2025-05-07T20:32:47.3464823Z         T: int,
2025-05-07T20:32:47.3464909Z         D: int,
2025-05-07T20:32:47.3465060Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3465165Z         contiguous: bool,
2025-05-07T20:32:47.3465256Z         compiled: bool,
2025-05-07T20:32:47.3465340Z     ) -> None:
2025-05-07T20:32:47.3465447Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3465524Z     
2025-05-07T20:32:47.3465703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3465788Z     
2025-05-07T20:32:47.3465883Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3466013Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3466112Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3466240Z         x0 = x[:, :D]
2025-05-07T20:32:47.3466330Z         x1 = x[:, D:]
2025-05-07T20:32:47.3466408Z     
2025-05-07T20:32:47.3466496Z         if contiguous:
2025-05-07T20:32:47.3466598Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3466692Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3466808Z     
2025-05-07T20:32:47.3466909Z         if scale_ub is not None:
2025-05-07T20:32:47.3467024Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3467202Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3467289Z             )
2025-05-07T20:32:47.3467369Z         else:
2025-05-07T20:32:47.3467468Z             scale_ub_tensor = None
2025-05-07T20:32:47.3467557Z     
2025-05-07T20:32:47.3467695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3467795Z             op = silu_mul_quant
2025-05-07T20:32:47.3467884Z             if compiled:
2025-05-07T20:32:47.3467989Z                 op = torch.compile(op)
2025-05-07T20:32:47.3468109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3468190Z     
2025-05-07T20:32:47.3468288Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3468293Z 
2025-05-07T20:32:47.3468400Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3468536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3468646Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3468761Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3469278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3469386Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3469760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3469996Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3470357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3470459Z     kernel = self.compile(
2025-05-07T20:32:47.3470858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3471048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3471190Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3471194Z 
2025-05-07T20:32:47.3471410Z self = <triton.compiler.compiler.ASTSource object at 0x7f08995e3850>
2025-05-07T20:32:47.3472212Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3472738Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993fe7a0>}
2025-05-07T20:32:47.3473511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3473712Z context = <triton._C.libtriton.ir.context object at 0x7f089927e870>
2025-05-07T20:32:47.3473762Z 
2025-05-07T20:32:47.3473947Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3474224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3474342Z                            module_map=module_map)
2025-05-07T20:32:47.3474512Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3474615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3474699Z E       ^
2025-05-07T20:32:47.3475068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3475115Z 
2025-05-07T20:32:47.3475550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3475560Z 
2025-05-07T20:32:47.3475670Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3475947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3476070Z     T=1,
2025-05-07T20:32:47.3476152Z     D=5120,
2025-05-07T20:32:47.3476240Z     scale_ub=None,
2025-05-07T20:32:47.3476337Z     contiguous=True,
2025-05-07T20:32:47.3476425Z     compiled=True,
2025-05-07T20:32:47.3476502Z )
2025-05-07T20:32:47.3476736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3476908Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3476912Z 
2025-05-07T20:32:47.3476998Z     @given(
2025-05-07T20:32:47.3477127Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3477232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3477358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3477481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3477600Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3477686Z     )
2025-05-07T20:32:47.3477949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3478049Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3478134Z         self,
2025-05-07T20:32:47.3478214Z         T: int,
2025-05-07T20:32:47.3478293Z         D: int,
2025-05-07T20:32:47.3478400Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3478495Z         contiguous: bool,
2025-05-07T20:32:47.3478591Z         compiled: bool,
2025-05-07T20:32:47.3478674Z     ) -> None:
2025-05-07T20:32:47.3478774Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3478853Z     
2025-05-07T20:32:47.3479035Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3479113Z     
2025-05-07T20:32:47.3479213Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3479347Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3479439Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3479531Z         x0 = x[:, :D]
2025-05-07T20:32:47.3479617Z         x1 = x[:, D:]
2025-05-07T20:32:47.3479695Z     
2025-05-07T20:32:47.3479792Z         if contiguous:
2025-05-07T20:32:47.3479889Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3479985Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3480062Z     
2025-05-07T20:32:47.3480212Z         if scale_ub is not None:
2025-05-07T20:32:47.3480329Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3480472Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3480552Z             )
2025-05-07T20:32:47.3480640Z         else:
2025-05-07T20:32:47.3480746Z             scale_ub_tensor = None
2025-05-07T20:32:47.3480822Z     
2025-05-07T20:32:47.3480963Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3481058Z             op = silu_mul_quant
2025-05-07T20:32:47.3481148Z             if compiled:
2025-05-07T20:32:47.3481256Z                 op = torch.compile(op)
2025-05-07T20:32:47.3481370Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3481495Z     
2025-05-07T20:32:47.3481600Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3481729Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3481811Z     
2025-05-07T20:32:47.3481954Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3482061Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3482172Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3482300Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3482449Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3482574Z     
2025-05-07T20:32:47.3482679Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3482683Z 
2025-05-07T20:32:47.3482790Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3482925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3483074Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3483225Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3483840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3483949Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3484328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3484564Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3484952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3485225Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3485618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3485804Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3486165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3486246Z     fn()
2025-05-07T20:32:47.3486672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3486759Z     self.fn.run(
2025-05-07T20:32:47.3487117Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3487215Z     kernel = self.compile(
2025-05-07T20:32:47.3487614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3487803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3487937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3487945Z 
2025-05-07T20:32:47.3488166Z self = <triton.compiler.compiler.ASTSource object at 0x7f08989c6a80>
2025-05-07T20:32:47.3488969Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3489491Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993ff420>}
2025-05-07T20:32:47.3490276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3490478Z context = <triton._C.libtriton.ir.context object at 0x7f08992656f0>
2025-05-07T20:32:47.3490483Z 
2025-05-07T20:32:47.3490659Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3490990Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3491106Z                            module_map=module_map)
2025-05-07T20:32:47.3491284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3491392Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3491473Z E       ^
2025-05-07T20:32:47.3491846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3491851Z 
2025-05-07T20:32:47.3492282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3492326Z 
2025-05-07T20:32:47.3492442Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3492677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3492824Z     T=2048,
2025-05-07T20:32:47.3492909Z     D=5120,
2025-05-07T20:32:47.3493000Z     scale_ub=None,
2025-05-07T20:32:47.3493096Z     contiguous=True,
2025-05-07T20:32:47.3493220Z     compiled=True,
2025-05-07T20:32:47.3493299Z )
2025-05-07T20:32:47.3493535Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3493716Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3493721Z 
2025-05-07T20:32:47.3493801Z     @given(
2025-05-07T20:32:47.3493934Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3494039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3494162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3494293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3494412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3494499Z     )
2025-05-07T20:32:47.3494758Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3494862Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3494949Z         self,
2025-05-07T20:32:47.3495032Z         T: int,
2025-05-07T20:32:47.3495112Z         D: int,
2025-05-07T20:32:47.3495221Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3495315Z         contiguous: bool,
2025-05-07T20:32:47.3495405Z         compiled: bool,
2025-05-07T20:32:47.3495491Z     ) -> None:
2025-05-07T20:32:47.3495589Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3495665Z     
2025-05-07T20:32:47.3495844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3495923Z     
2025-05-07T20:32:47.3496028Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3496160Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3496253Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3496343Z         x0 = x[:, :D]
2025-05-07T20:32:47.3496427Z         x1 = x[:, D:]
2025-05-07T20:32:47.3496502Z     
2025-05-07T20:32:47.3496596Z         if contiguous:
2025-05-07T20:32:47.3496692Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3496795Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3496880Z     
2025-05-07T20:32:47.3500788Z         if scale_ub is not None:
2025-05-07T20:32:47.3500923Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3501074Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3501155Z             )
2025-05-07T20:32:47.3501236Z         else:
2025-05-07T20:32:47.3501341Z             scale_ub_tensor = None
2025-05-07T20:32:47.3501418Z     
2025-05-07T20:32:47.3501558Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3501665Z             op = silu_mul_quant
2025-05-07T20:32:47.3501756Z             if compiled:
2025-05-07T20:32:47.3501864Z                 op = torch.compile(op)
2025-05-07T20:32:47.3501982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3502061Z     
2025-05-07T20:32:47.3502166Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3502299Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3502447Z     
2025-05-07T20:32:47.3502598Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3502707Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3502814Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3502948Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3503096Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3503173Z     
2025-05-07T20:32:47.3503283Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3503288Z 
2025-05-07T20:32:47.3503438Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3503584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3503696Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3503840Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3504509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3504618Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3504996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3505239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3505627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3505907Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3506302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3506479Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3506840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3506925Z     fn()
2025-05-07T20:32:47.3507350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3507437Z     self.fn.run(
2025-05-07T20:32:47.3507789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3507889Z     kernel = self.compile(
2025-05-07T20:32:47.3508289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3508477Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3508621Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3508626Z 
2025-05-07T20:32:47.3508843Z self = <triton.compiler.compiler.ASTSource object at 0x7f08989c6b70>
2025-05-07T20:32:47.3509660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3510182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08993b9da0>}
2025-05-07T20:32:47.3510948Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3511154Z context = <triton._C.libtriton.ir.context object at 0x7f08986a68f0>
2025-05-07T20:32:47.3511159Z 
2025-05-07T20:32:47.3511333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3511613Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3511771Z                            module_map=module_map)
2025-05-07T20:32:47.3511945Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3512054Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3512133Z E       ^
2025-05-07T20:32:47.3512505Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3512510Z 
2025-05-07T20:32:47.3512940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3512944Z 
2025-05-07T20:32:47.3513093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3513530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3513655Z     T=128,
2025-05-07T20:32:47.3513770Z     D=5120,
2025-05-07T20:32:47.3513871Z     scale_ub=None,
2025-05-07T20:32:47.3513961Z     contiguous=True,
2025-05-07T20:32:47.3514144Z     compiled=True,
2025-05-07T20:32:47.3514223Z )
2025-05-07T20:32:47.3514522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3514707Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3514712Z 
2025-05-07T20:32:47.3514793Z     @given(
2025-05-07T20:32:47.3514919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3515026Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3515147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3515269Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3515397Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3515477Z     )
2025-05-07T20:32:47.3515741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3515840Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3515919Z         self,
2025-05-07T20:32:47.3516005Z         T: int,
2025-05-07T20:32:47.3516087Z         D: int,
2025-05-07T20:32:47.3516192Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3516296Z         contiguous: bool,
2025-05-07T20:32:47.3516388Z         compiled: bool,
2025-05-07T20:32:47.3516472Z     ) -> None:
2025-05-07T20:32:47.3516577Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3516653Z     
2025-05-07T20:32:47.3516832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3516913Z     
2025-05-07T20:32:47.3517009Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3517146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3517242Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3517332Z         x0 = x[:, :D]
2025-05-07T20:32:47.3517420Z         x1 = x[:, D:]
2025-05-07T20:32:47.3517496Z     
2025-05-07T20:32:47.3517584Z         if contiguous:
2025-05-07T20:32:47.3517685Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3517777Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3517855Z     
2025-05-07T20:32:47.3517954Z         if scale_ub is not None:
2025-05-07T20:32:47.3518071Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3518214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3518296Z             )
2025-05-07T20:32:47.3518375Z         else:
2025-05-07T20:32:47.3518476Z             scale_ub_tensor = None
2025-05-07T20:32:47.3518552Z     
2025-05-07T20:32:47.3518686Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3518784Z             op = silu_mul_quant
2025-05-07T20:32:47.3518877Z             if compiled:
2025-05-07T20:32:47.3518980Z                 op = torch.compile(op)
2025-05-07T20:32:47.3519096Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3519172Z     
2025-05-07T20:32:47.3519267Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3519401Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3519476Z     
2025-05-07T20:32:47.3519621Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3519800Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3519910Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3520047Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3520291Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3520367Z     
2025-05-07T20:32:47.3520479Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3520484Z 
2025-05-07T20:32:47.3520586Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3520719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3520900Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3521042Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3521625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3521774Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3522912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3523158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3523543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3523816Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3524214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3524393Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3524754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3524835Z     fn()
2025-05-07T20:32:47.3525261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3525353Z     self.fn.run(
2025-05-07T20:32:47.3525708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3525807Z     kernel = self.compile(
2025-05-07T20:32:47.3526211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3526395Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3526534Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3526541Z 
2025-05-07T20:32:47.3526755Z self = <triton.compiler.compiler.ASTSource object at 0x7f089903def0>
2025-05-07T20:32:47.3527563Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3528100Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898508540>}
2025-05-07T20:32:47.3528873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3529077Z context = <triton._C.libtriton.ir.context object at 0x7f089877df70>
2025-05-07T20:32:47.3529082Z 
2025-05-07T20:32:47.3529258Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3529543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3529654Z                            module_map=module_map)
2025-05-07T20:32:47.3529824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3529984Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3530066Z E       ^
2025-05-07T20:32:47.3530438Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3530443Z 
2025-05-07T20:32:47.3530876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3530881Z 
2025-05-07T20:32:47.3530990Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3531228Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3531474Z     T=4096,
2025-05-07T20:32:47.3531554Z     D=5120,
2025-05-07T20:32:47.3531644Z     scale_ub=None,
2025-05-07T20:32:47.3531732Z     contiguous=True,
2025-05-07T20:32:47.3531817Z     compiled=True,
2025-05-07T20:32:47.3531898Z )
2025-05-07T20:32:47.3532129Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3532354Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3532364Z 
2025-05-07T20:32:47.3532483Z     @given(
2025-05-07T20:32:47.3532612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3532721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3532843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3532965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3533088Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3533166Z     )
2025-05-07T20:32:47.3533422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3533530Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3533610Z         self,
2025-05-07T20:32:47.3533690Z         T: int,
2025-05-07T20:32:47.3533773Z         D: int,
2025-05-07T20:32:47.3533875Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3533973Z         contiguous: bool,
2025-05-07T20:32:47.3534066Z         compiled: bool,
2025-05-07T20:32:47.3534149Z     ) -> None:
2025-05-07T20:32:47.3534253Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3534328Z     
2025-05-07T20:32:47.3534505Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3534584Z     
2025-05-07T20:32:47.3534680Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3534815Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3534912Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3534996Z         x0 = x[:, :D]
2025-05-07T20:32:47.3535080Z         x1 = x[:, D:]
2025-05-07T20:32:47.3535160Z     
2025-05-07T20:32:47.3535252Z         if contiguous:
2025-05-07T20:32:47.3535351Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3535445Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3535521Z     
2025-05-07T20:32:47.3535621Z         if scale_ub is not None:
2025-05-07T20:32:47.3535732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3535875Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3535960Z             )
2025-05-07T20:32:47.3536042Z         else:
2025-05-07T20:32:47.3536141Z             scale_ub_tensor = None
2025-05-07T20:32:47.3536220Z     
2025-05-07T20:32:47.3536354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3536449Z             op = silu_mul_quant
2025-05-07T20:32:47.3536542Z             if compiled:
2025-05-07T20:32:47.3536646Z                 op = torch.compile(op)
2025-05-07T20:32:47.3536758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3536835Z     
2025-05-07T20:32:47.3536934Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3537063Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3537138Z     
2025-05-07T20:32:47.3537280Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3537388Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3537494Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3537673Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3537828Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3537905Z     
2025-05-07T20:32:47.3538010Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3538020Z 
2025-05-07T20:32:47.3538123Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3538261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3538372Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3538512Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3539134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3539245Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3539619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3539937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3540320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3540594Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3540991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3541168Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3541526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3541612Z     fn()
2025-05-07T20:32:47.3542028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3542116Z     self.fn.run(
2025-05-07T20:32:47.3542474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3542574Z     kernel = self.compile(
2025-05-07T20:32:47.3542974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3543159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3543292Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3543300Z 
2025-05-07T20:32:47.3543511Z self = <triton.compiler.compiler.ASTSource object at 0x7f08987beb30>
2025-05-07T20:32:47.3544316Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3544847Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f08986d2200>}
2025-05-07T20:32:47.3545624Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3545827Z context = <triton._C.libtriton.ir.context object at 0x7f08982df2f0>
2025-05-07T20:32:47.3545832Z 
2025-05-07T20:32:47.3546005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3546283Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3546406Z                            module_map=module_map)
2025-05-07T20:32:47.3546581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3546691Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3546771Z E       ^
2025-05-07T20:32:47.3547187Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3547194Z 
2025-05-07T20:32:47.3547632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3547637Z 
2025-05-07T20:32:47.3547746Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3547980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3548068Z     T=16384,
2025-05-07T20:32:47.3548148Z     D=5120,
2025-05-07T20:32:47.3548237Z     scale_ub=None,
2025-05-07T20:32:47.3548325Z     contiguous=True,
2025-05-07T20:32:47.3548454Z     compiled=True,
2025-05-07T20:32:47.3548533Z )
2025-05-07T20:32:47.3548762Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3548947Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3548952Z 
2025-05-07T20:32:47.3549076Z     @given(
2025-05-07T20:32:47.3549204Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3549346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3549471Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3549593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3549717Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3549794Z     )
2025-05-07T20:32:47.3550051Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3550151Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3550230Z         self,
2025-05-07T20:32:47.3550314Z         T: int,
2025-05-07T20:32:47.3550399Z         D: int,
2025-05-07T20:32:47.3550501Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3550593Z         contiguous: bool,
2025-05-07T20:32:47.3550684Z         compiled: bool,
2025-05-07T20:32:47.3550766Z     ) -> None:
2025-05-07T20:32:47.3550864Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3550949Z     
2025-05-07T20:32:47.3551130Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3551213Z     
2025-05-07T20:32:47.3551307Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3551435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3551531Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3551612Z         x0 = x[:, :D]
2025-05-07T20:32:47.3551695Z         x1 = x[:, D:]
2025-05-07T20:32:47.3551772Z     
2025-05-07T20:32:47.3551858Z         if contiguous:
2025-05-07T20:32:47.3551954Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3552051Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3552130Z     
2025-05-07T20:32:47.3552224Z         if scale_ub is not None:
2025-05-07T20:32:47.3552339Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3552481Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3552562Z             )
2025-05-07T20:32:47.3552644Z         else:
2025-05-07T20:32:47.3552743Z             scale_ub_tensor = None
2025-05-07T20:32:47.3552824Z     
2025-05-07T20:32:47.3552959Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3553054Z             op = silu_mul_quant
2025-05-07T20:32:47.3553143Z             if compiled:
2025-05-07T20:32:47.3553246Z                 op = torch.compile(op)
2025-05-07T20:32:47.3553356Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3553435Z     
2025-05-07T20:32:47.3553531Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3553657Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3553734Z     
2025-05-07T20:32:47.3553880Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3553988Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3554092Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3554225Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3554373Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3554495Z     
2025-05-07T20:32:47.3554606Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3554611Z 
2025-05-07T20:32:47.3554714Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3554847Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3554958Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3555096Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3555672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3555823Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3556197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3556435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3556859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3557165Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3557563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3557738Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3558098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3558178Z     fn()
2025-05-07T20:32:47.3558599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3558688Z     self.fn.run(
2025-05-07T20:32:47.3559041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3559140Z     kernel = self.compile(
2025-05-07T20:32:47.3559542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3559726Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3559862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3559866Z 
2025-05-07T20:32:47.3560162Z self = <triton.compiler.compiler.ASTSource object at 0x7f08987ae0d0>
2025-05-07T20:32:47.3560963Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3561497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07adb34900>}
2025-05-07T20:32:47.3562276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3562479Z context = <triton._C.libtriton.ir.context object at 0x7f0898880ff0>
2025-05-07T20:32:47.3562483Z 
2025-05-07T20:32:47.3562656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3562934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3563046Z                            module_map=module_map)
2025-05-07T20:32:47.3563217Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3563328Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3563407Z E       ^
2025-05-07T20:32:47.3563776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3563783Z 
2025-05-07T20:32:47.3564289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3564294Z 
2025-05-07T20:32:47.3564404Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3564643Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3564723Z     T=1,
2025-05-07T20:32:47.3564803Z     D=5120,
2025-05-07T20:32:47.3564894Z     scale_ub=1200.0,
2025-05-07T20:32:47.3564981Z     contiguous=True,
2025-05-07T20:32:47.3565067Z     compiled=True,
2025-05-07T20:32:47.3565147Z )
2025-05-07T20:32:47.3565375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3565589Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.3565597Z 
2025-05-07T20:32:47.3565676Z     @given(
2025-05-07T20:32:47.3565801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3565906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3566068Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3566228Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3566351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3566429Z     )
2025-05-07T20:32:47.3566686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3566788Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3566867Z         self,
2025-05-07T20:32:47.3566947Z         T: int,
2025-05-07T20:32:47.3567029Z         D: int,
2025-05-07T20:32:47.3567172Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3567306Z         contiguous: bool,
2025-05-07T20:32:47.3567432Z         compiled: bool,
2025-05-07T20:32:47.3567547Z     ) -> None:
2025-05-07T20:32:47.3567671Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3567745Z     
2025-05-07T20:32:47.3567923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3568003Z     
2025-05-07T20:32:47.3568098Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3568233Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3568323Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3568406Z         x0 = x[:, :D]
2025-05-07T20:32:47.3568493Z         x1 = x[:, D:]
2025-05-07T20:32:47.3568569Z     
2025-05-07T20:32:47.3568656Z         if contiguous:
2025-05-07T20:32:47.3568752Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3568843Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3568921Z     
2025-05-07T20:32:47.3569015Z         if scale_ub is not None:
2025-05-07T20:32:47.3569124Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3569268Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3569346Z             )
2025-05-07T20:32:47.3569422Z         else:
2025-05-07T20:32:47.3569521Z             scale_ub_tensor = None
2025-05-07T20:32:47.3569595Z     
2025-05-07T20:32:47.3569728Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3569830Z             op = silu_mul_quant
2025-05-07T20:32:47.3569920Z             if compiled:
2025-05-07T20:32:47.3570022Z                 op = torch.compile(op)
2025-05-07T20:32:47.3570135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3570209Z     
2025-05-07T20:32:47.3570304Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3570308Z 
2025-05-07T20:32:47.3570407Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3570540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3570649Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3570754Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3571136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3571239Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3571749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3571911Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3572286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3572518Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3572873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3572968Z     kernel = self.compile(
2025-05-07T20:32:47.3573362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3573589Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3573720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3573725Z 
2025-05-07T20:32:47.3573937Z self = <triton.compiler.compiler.ASTSource object at 0x7f089826c410>
2025-05-07T20:32:47.3574819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3575348Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad73cd60>}
2025-05-07T20:32:47.3576117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3576316Z context = <triton._C.libtriton.ir.context object at 0x7f08989021f0>
2025-05-07T20:32:47.3576320Z 
2025-05-07T20:32:47.3576494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3576773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3576891Z                            module_map=module_map)
2025-05-07T20:32:47.3577059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3577161Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3577243Z E       ^
2025-05-07T20:32:47.3577610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3577615Z 
2025-05-07T20:32:47.3578067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3578084Z 
2025-05-07T20:32:47.3578235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3578546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3578659Z     T=1,
2025-05-07T20:32:47.3578739Z     D=5120,
2025-05-07T20:32:47.3578823Z     scale_ub=None,
2025-05-07T20:32:47.3578921Z     contiguous=False,
2025-05-07T20:32:47.3579011Z     compiled=True,
2025-05-07T20:32:47.3579087Z )
2025-05-07T20:32:47.3579323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3579492Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3579497Z 
2025-05-07T20:32:47.3579583Z     @given(
2025-05-07T20:32:47.3579707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3579809Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3579931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3580052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3580172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3580252Z     )
2025-05-07T20:32:47.3580507Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3580606Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3580691Z         self,
2025-05-07T20:32:47.3580770Z         T: int,
2025-05-07T20:32:47.3580907Z         D: int,
2025-05-07T20:32:47.3581018Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3581113Z         contiguous: bool,
2025-05-07T20:32:47.3581203Z         compiled: bool,
2025-05-07T20:32:47.3581284Z     ) -> None:
2025-05-07T20:32:47.3581382Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3581459Z     
2025-05-07T20:32:47.3581635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3581712Z     
2025-05-07T20:32:47.3581808Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3581936Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3582069Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3582157Z         x0 = x[:, :D]
2025-05-07T20:32:47.3582240Z         x1 = x[:, D:]
2025-05-07T20:32:47.3582314Z     
2025-05-07T20:32:47.3582402Z         if contiguous:
2025-05-07T20:32:47.3582495Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3582625Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3582702Z     
2025-05-07T20:32:47.3582837Z         if scale_ub is not None:
2025-05-07T20:32:47.3582953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3583093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3583172Z             )
2025-05-07T20:32:47.3583253Z         else:
2025-05-07T20:32:47.3583350Z             scale_ub_tensor = None
2025-05-07T20:32:47.3583426Z     
2025-05-07T20:32:47.3583562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3583656Z             op = silu_mul_quant
2025-05-07T20:32:47.3583743Z             if compiled:
2025-05-07T20:32:47.3583852Z                 op = torch.compile(op)
2025-05-07T20:32:47.3583960Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3584034Z     
2025-05-07T20:32:47.3584132Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3584257Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3584338Z     
2025-05-07T20:32:47.3584479Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3584586Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3584693Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3584818Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3584962Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3585040Z     
2025-05-07T20:32:47.3585143Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3585148Z 
2025-05-07T20:32:47.3585249Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3585385Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3585495Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3585640Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3586216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3586327Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3586707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3586942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3587322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3587588Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3587977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3588156Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3588510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3588591Z     fn()
2025-05-07T20:32:47.3589059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3589145Z     self.fn.run(
2025-05-07T20:32:47.3589499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3589596Z     kernel = self.compile(
2025-05-07T20:32:47.3589988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3590176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3590351Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3590355Z 
2025-05-07T20:32:47.3590571Z self = <triton.compiler.compiler.ASTSource object at 0x7f089826eb10>
2025-05-07T20:32:47.3591410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3591993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad732de0>}
2025-05-07T20:32:47.3592763Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3592966Z context = <triton._C.libtriton.ir.context object at 0x7f07ad4d7330>
2025-05-07T20:32:47.3592972Z 
2025-05-07T20:32:47.3593148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3593423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3593534Z                            module_map=module_map)
2025-05-07T20:32:47.3593710Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3593821Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3593901Z E       ^
2025-05-07T20:32:47.3594272Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3594276Z 
2025-05-07T20:32:47.3594706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3594710Z 
2025-05-07T20:32:47.3594822Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3595053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3595135Z     T=1,
2025-05-07T20:32:47.3595220Z     D=5120,
2025-05-07T20:32:47.3595304Z     scale_ub=None,
2025-05-07T20:32:47.3595396Z     contiguous=True,
2025-05-07T20:32:47.3595482Z     compiled=False,
2025-05-07T20:32:47.3595557Z )
2025-05-07T20:32:47.3595789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3595964Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.3595969Z 
2025-05-07T20:32:47.3596046Z     @given(
2025-05-07T20:32:47.3596175Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3596280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3596398Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3596525Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3596642Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3596721Z     )
2025-05-07T20:32:47.3596976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3597071Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3597152Z         self,
2025-05-07T20:32:47.3597232Z         T: int,
2025-05-07T20:32:47.3597309Z         D: int,
2025-05-07T20:32:47.3597416Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3597512Z         contiguous: bool,
2025-05-07T20:32:47.3597656Z         compiled: bool,
2025-05-07T20:32:47.3597743Z     ) -> None:
2025-05-07T20:32:47.3597841Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3597915Z     
2025-05-07T20:32:47.3598097Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3598174Z     
2025-05-07T20:32:47.3598268Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3598399Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3598489Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3598574Z         x0 = x[:, :D]
2025-05-07T20:32:47.3598699Z         x1 = x[:, D:]
2025-05-07T20:32:47.3598775Z     
2025-05-07T20:32:47.3598864Z         if contiguous:
2025-05-07T20:32:47.3598958Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3599051Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3599128Z     
2025-05-07T20:32:47.3599222Z         if scale_ub is not None:
2025-05-07T20:32:47.3599372Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3599556Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3599634Z             )
2025-05-07T20:32:47.3599712Z         else:
2025-05-07T20:32:47.3599813Z             scale_ub_tensor = None
2025-05-07T20:32:47.3599887Z     
2025-05-07T20:32:47.3600025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3600228Z             op = silu_mul_quant
2025-05-07T20:32:47.3600319Z             if compiled:
2025-05-07T20:32:47.3600424Z                 op = torch.compile(op)
2025-05-07T20:32:47.3600534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3600611Z     
2025-05-07T20:32:47.3600709Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3600713Z 
2025-05-07T20:32:47.3600815Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3600947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3601053Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3601160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3601684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3601785Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3602156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3602392Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3602745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3602843Z     kernel = self.compile(
2025-05-07T20:32:47.3603243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3603426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3603562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3603568Z 
2025-05-07T20:32:47.3603783Z self = <triton.compiler.compiler.ASTSource object at 0x7f07addb33e0>
2025-05-07T20:32:47.3604581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3605103Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898019b20>}
2025-05-07T20:32:47.3605869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3606072Z context = <triton._C.libtriton.ir.context object at 0x7f07ad48b170>
2025-05-07T20:32:47.3606080Z 
2025-05-07T20:32:47.3606300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3606583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3606696Z                            module_map=module_map)
2025-05-07T20:32:47.3606863Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3606969Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3607048Z E       ^
2025-05-07T20:32:47.3607413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3607457Z 
2025-05-07T20:32:47.3607890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3607895Z 
2025-05-07T20:32:47.3608002Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3608235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3608356Z     T=128,
2025-05-07T20:32:47.3608437Z     D=5120,
2025-05-07T20:32:47.3608565Z     scale_ub=None,
2025-05-07T20:32:47.3608656Z     contiguous=False,
2025-05-07T20:32:47.3608742Z     compiled=True,
2025-05-07T20:32:47.3608822Z )
2025-05-07T20:32:47.3609049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3609225Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3609235Z 
2025-05-07T20:32:47.3609315Z     @given(
2025-05-07T20:32:47.3609439Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3609550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3609669Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3609791Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3609912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3609988Z     )
2025-05-07T20:32:47.3610249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3610355Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3610433Z         self,
2025-05-07T20:32:47.3610511Z         T: int,
2025-05-07T20:32:47.3610592Z         D: int,
2025-05-07T20:32:47.3610694Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3610789Z         contiguous: bool,
2025-05-07T20:32:47.3610877Z         compiled: bool,
2025-05-07T20:32:47.3610958Z     ) -> None:
2025-05-07T20:32:47.3611058Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3611136Z     
2025-05-07T20:32:47.3611312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3611392Z     
2025-05-07T20:32:47.3611486Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3611615Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3611708Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3611790Z         x0 = x[:, :D]
2025-05-07T20:32:47.3611875Z         x1 = x[:, D:]
2025-05-07T20:32:47.3611952Z     
2025-05-07T20:32:47.3612042Z         if contiguous:
2025-05-07T20:32:47.3612141Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3612233Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3612307Z     
2025-05-07T20:32:47.3612404Z         if scale_ub is not None:
2025-05-07T20:32:47.3612513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3612652Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3612735Z             )
2025-05-07T20:32:47.3612813Z         else:
2025-05-07T20:32:47.3612909Z             scale_ub_tensor = None
2025-05-07T20:32:47.3612986Z     
2025-05-07T20:32:47.3613122Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3613215Z             op = silu_mul_quant
2025-05-07T20:32:47.3613305Z             if compiled:
2025-05-07T20:32:47.3613767Z                 op = torch.compile(op)
2025-05-07T20:32:47.3613885Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3613962Z     
2025-05-07T20:32:47.3614056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3614152Z 
2025-05-07T20:32:47.3614262Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3614395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3614499Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3614606Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3614989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3615086Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3615600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3615759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3616133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3616422Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3616832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3616934Z     kernel = self.compile(
2025-05-07T20:32:47.3617328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3617513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3617648Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3617653Z 
2025-05-07T20:32:47.3617871Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad738e10>
2025-05-07T20:32:47.3618677Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3619208Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad733a60>}
2025-05-07T20:32:47.3619980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3620179Z context = <triton._C.libtriton.ir.context object at 0x7f07ad4e1ff0>
2025-05-07T20:32:47.3620183Z 
2025-05-07T20:32:47.3620357Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3620637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3620748Z                            module_map=module_map)
2025-05-07T20:32:47.3620920Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3621023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3621104Z E       ^
2025-05-07T20:32:47.3621481Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3621485Z 
2025-05-07T20:32:47.3621913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3621918Z 
2025-05-07T20:32:47.3622028Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3622262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3622341Z     T=128,
2025-05-07T20:32:47.3622425Z     D=7168,
2025-05-07T20:32:47.3622513Z     scale_ub=1200.0,
2025-05-07T20:32:47.3622603Z     contiguous=False,
2025-05-07T20:32:47.3622692Z     compiled=False,
2025-05-07T20:32:47.3622767Z )
2025-05-07T20:32:47.3622991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3623177Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3623184Z 
2025-05-07T20:32:47.3623309Z     @given(
2025-05-07T20:32:47.3623444Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3623547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3623666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3623790Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3623906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3623983Z     )
2025-05-07T20:32:47.3624240Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3624379Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3624457Z         self,
2025-05-07T20:32:47.3624538Z         T: int,
2025-05-07T20:32:47.3624615Z         D: int,
2025-05-07T20:32:47.3624716Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3624813Z         contiguous: bool,
2025-05-07T20:32:47.3624903Z         compiled: bool,
2025-05-07T20:32:47.3625053Z     ) -> None:
2025-05-07T20:32:47.3625155Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3625231Z     
2025-05-07T20:32:47.3625447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3625527Z     
2025-05-07T20:32:47.3625622Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3625753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3625843Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3625924Z         x0 = x[:, :D]
2025-05-07T20:32:47.3626010Z         x1 = x[:, D:]
2025-05-07T20:32:47.3626087Z     
2025-05-07T20:32:47.3626172Z         if contiguous:
2025-05-07T20:32:47.3626273Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3626366Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3626442Z     
2025-05-07T20:32:47.3626535Z         if scale_ub is not None:
2025-05-07T20:32:47.3626643Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3626785Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3626865Z             )
2025-05-07T20:32:47.3626946Z         else:
2025-05-07T20:32:47.3627049Z             scale_ub_tensor = None
2025-05-07T20:32:47.3627124Z     
2025-05-07T20:32:47.3627257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3627353Z             op = silu_mul_quant
2025-05-07T20:32:47.3627441Z             if compiled:
2025-05-07T20:32:47.3627544Z                 op = torch.compile(op)
2025-05-07T20:32:47.3627655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3627731Z     
2025-05-07T20:32:47.3627823Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3627837Z 
2025-05-07T20:32:47.3627940Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3628072Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3631778Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3631908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3632448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3632558Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3632936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3633178Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3633536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3633641Z     kernel = self.compile(
2025-05-07T20:32:47.3634042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3634231Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3634376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3634381Z 
2025-05-07T20:32:47.3634599Z self = <triton.compiler.compiler.ASTSource object at 0x7f08991a8230>
2025-05-07T20:32:47.3635477Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3636009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07adb3a2a0>}
2025-05-07T20:32:47.3636784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3637028Z context = <triton._C.libtriton.ir.context object at 0x7f07ad310cf0>
2025-05-07T20:32:47.3637032Z 
2025-05-07T20:32:47.3637208Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3637535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3637684Z                            module_map=module_map)
2025-05-07T20:32:47.3637857Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3637971Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3638053Z E       ^
2025-05-07T20:32:47.3638427Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3638432Z 
2025-05-07T20:32:47.3638870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3638877Z 
2025-05-07T20:32:47.3638987Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3639227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3639308Z     T=128,
2025-05-07T20:32:47.3639393Z     D=5120,
2025-05-07T20:32:47.3639483Z     scale_ub=None,
2025-05-07T20:32:47.3639579Z     contiguous=False,
2025-05-07T20:32:47.3639676Z     compiled=False,
2025-05-07T20:32:47.3639757Z )
2025-05-07T20:32:47.3639989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3640268Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.3640274Z 
2025-05-07T20:32:47.3640355Z     @given(
2025-05-07T20:32:47.3640483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3640594Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3640717Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3640847Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3640967Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3641045Z     )
2025-05-07T20:32:47.3641307Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3641410Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3641490Z         self,
2025-05-07T20:32:47.3641575Z         T: int,
2025-05-07T20:32:47.3641657Z         D: int,
2025-05-07T20:32:47.3641762Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3641858Z         contiguous: bool,
2025-05-07T20:32:47.3641950Z         compiled: bool,
2025-05-07T20:32:47.3642034Z     ) -> None:
2025-05-07T20:32:47.3642136Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3642212Z     
2025-05-07T20:32:47.3642393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3642475Z     
2025-05-07T20:32:47.3642571Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3642707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3642800Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3642885Z         x0 = x[:, :D]
2025-05-07T20:32:47.3642971Z         x1 = x[:, D:]
2025-05-07T20:32:47.3643048Z     
2025-05-07T20:32:47.3643136Z         if contiguous:
2025-05-07T20:32:47.3643237Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3643379Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3643456Z     
2025-05-07T20:32:47.3643558Z         if scale_ub is not None:
2025-05-07T20:32:47.3643671Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3643815Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3643901Z             )
2025-05-07T20:32:47.3643981Z         else:
2025-05-07T20:32:47.3644083Z             scale_ub_tensor = None
2025-05-07T20:32:47.3644159Z     
2025-05-07T20:32:47.3644296Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3644396Z             op = silu_mul_quant
2025-05-07T20:32:47.3644531Z             if compiled:
2025-05-07T20:32:47.3644636Z                 op = torch.compile(op)
2025-05-07T20:32:47.3644751Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3644827Z     
2025-05-07T20:32:47.3644923Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3644927Z 
2025-05-07T20:32:47.3645072Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3645250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3645362Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3645468Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3645991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3646100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3646474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3646712Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3647070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3647170Z     kernel = self.compile(
2025-05-07T20:32:47.3647575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3647772Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3647906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3647910Z 
2025-05-07T20:32:47.3648132Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad73ca50>
2025-05-07T20:32:47.3648937Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3649474Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad73c720>}
2025-05-07T20:32:47.3650249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3650456Z context = <triton._C.libtriton.ir.context object at 0x7f07ad3d96b0>
2025-05-07T20:32:47.3650463Z 
2025-05-07T20:32:47.3650638Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3650914Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3651031Z                            module_map=module_map)
2025-05-07T20:32:47.3651203Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3651307Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3651395Z E       ^
2025-05-07T20:32:47.3651764Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3651769Z 
2025-05-07T20:32:47.3652206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3652212Z 
2025-05-07T20:32:47.3652368Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3652606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3652691Z     T=128,
2025-05-07T20:32:47.3652771Z     D=5120,
2025-05-07T20:32:47.3652860Z     scale_ub=1200.0,
2025-05-07T20:32:47.3652953Z     contiguous=True,
2025-05-07T20:32:47.3653041Z     compiled=False,
2025-05-07T20:32:47.3653118Z )
2025-05-07T20:32:47.3653353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3653534Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.3653581Z 
2025-05-07T20:32:47.3653667Z     @given(
2025-05-07T20:32:47.3653794Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3653900Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3654025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3654188Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3654345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3654427Z     )
2025-05-07T20:32:47.3654687Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3654785Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3654868Z         self,
2025-05-07T20:32:47.3654949Z         T: int,
2025-05-07T20:32:47.3655032Z         D: int,
2025-05-07T20:32:47.3655136Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3655230Z         contiguous: bool,
2025-05-07T20:32:47.3655323Z         compiled: bool,
2025-05-07T20:32:47.3655406Z     ) -> None:
2025-05-07T20:32:47.3655505Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3655584Z     
2025-05-07T20:32:47.3655762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3655839Z     
2025-05-07T20:32:47.3655939Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3656071Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3656164Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3656253Z         x0 = x[:, :D]
2025-05-07T20:32:47.3656336Z         x1 = x[:, D:]
2025-05-07T20:32:47.3656415Z     
2025-05-07T20:32:47.3656502Z         if contiguous:
2025-05-07T20:32:47.3656597Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3656692Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3656768Z     
2025-05-07T20:32:47.3656863Z         if scale_ub is not None:
2025-05-07T20:32:47.3656979Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3657120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3657201Z             )
2025-05-07T20:32:47.3657284Z         else:
2025-05-07T20:32:47.3657382Z             scale_ub_tensor = None
2025-05-07T20:32:47.3657457Z     
2025-05-07T20:32:47.3657595Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3657688Z             op = silu_mul_quant
2025-05-07T20:32:47.3657777Z             if compiled:
2025-05-07T20:32:47.3657886Z                 op = torch.compile(op)
2025-05-07T20:32:47.3657999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3658078Z     
2025-05-07T20:32:47.3658173Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3658177Z 
2025-05-07T20:32:47.3658277Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3658415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3658520Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3658624Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3659150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3659256Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3659632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3659917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3660276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3660379Z     kernel = self.compile(
2025-05-07T20:32:47.3660778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3660960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3661101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3661106Z 
2025-05-07T20:32:47.3661388Z self = <triton.compiler.compiler.ASTSource object at 0x7f0899ac9520>
2025-05-07T20:32:47.3662196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3662796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad3f8c20>}
2025-05-07T20:32:47.3663575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3663776Z context = <triton._C.libtriton.ir.context object at 0x7f07ad55afb0>
2025-05-07T20:32:47.3663780Z 
2025-05-07T20:32:47.3663954Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3664235Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3664347Z                            module_map=module_map)
2025-05-07T20:32:47.3664519Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3664627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3664706Z E       ^
2025-05-07T20:32:47.3665084Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3665089Z 
2025-05-07T20:32:47.3665518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3665523Z 
2025-05-07T20:32:47.3665630Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3665867Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3665947Z     T=1,
2025-05-07T20:32:47.3666032Z     D=7168,
2025-05-07T20:32:47.3666120Z     scale_ub=1200.0,
2025-05-07T20:32:47.3666210Z     contiguous=True,
2025-05-07T20:32:47.3666301Z     compiled=True,
2025-05-07T20:32:47.3666378Z )
2025-05-07T20:32:47.3666606Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3666782Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.3666790Z 
2025-05-07T20:32:47.3666875Z     @given(
2025-05-07T20:32:47.3667003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3667110Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3667229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3667355Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3667473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3667549Z     )
2025-05-07T20:32:47.3667809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3667909Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3667988Z         self,
2025-05-07T20:32:47.3668069Z         T: int,
2025-05-07T20:32:47.3668148Z         D: int,
2025-05-07T20:32:47.3668251Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3668350Z         contiguous: bool,
2025-05-07T20:32:47.3668438Z         compiled: bool,
2025-05-07T20:32:47.3668521Z     ) -> None:
2025-05-07T20:32:47.3668720Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3668799Z     
2025-05-07T20:32:47.3668989Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3669066Z     
2025-05-07T20:32:47.3669161Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3669296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3669389Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3669473Z         x0 = x[:, :D]
2025-05-07T20:32:47.3669558Z         x1 = x[:, D:]
2025-05-07T20:32:47.3669631Z     
2025-05-07T20:32:47.3669716Z         if contiguous:
2025-05-07T20:32:47.3669854Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3669946Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3670020Z     
2025-05-07T20:32:47.3670119Z         if scale_ub is not None:
2025-05-07T20:32:47.3670227Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3670371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3670489Z             )
2025-05-07T20:32:47.3670569Z         else:
2025-05-07T20:32:47.3670707Z             scale_ub_tensor = None
2025-05-07T20:32:47.3670782Z     
2025-05-07T20:32:47.3670916Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3671011Z             op = silu_mul_quant
2025-05-07T20:32:47.3671097Z             if compiled:
2025-05-07T20:32:47.3671199Z                 op = torch.compile(op)
2025-05-07T20:32:47.3671310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3671384Z     
2025-05-07T20:32:47.3671476Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3671481Z 
2025-05-07T20:32:47.3671585Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3671718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3671825Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3671928Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3672308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3672417Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3672928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3673032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3673405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3673637Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3673994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3674093Z     kernel = self.compile(
2025-05-07T20:32:47.3674489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3674675Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3674816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3674820Z 
2025-05-07T20:32:47.3675037Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adb13650>
2025-05-07T20:32:47.3675834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3676356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad3f9ee0>}
2025-05-07T20:32:47.3677125Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3677327Z context = <triton._C.libtriton.ir.context object at 0x7f07ad5505b0>
2025-05-07T20:32:47.3677376Z 
2025-05-07T20:32:47.3677553Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3677826Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3677935Z                            module_map=module_map)
2025-05-07T20:32:47.3678110Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3678212Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3678293Z E       ^
2025-05-07T20:32:47.3678659Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3678706Z 
2025-05-07T20:32:47.3679135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3679140Z 
2025-05-07T20:32:47.3679249Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3679523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3679640Z     T=1,
2025-05-07T20:32:47.3679722Z     D=7168,
2025-05-07T20:32:47.3679806Z     scale_ub=1200.0,
2025-05-07T20:32:47.3679900Z     contiguous=False,
2025-05-07T20:32:47.3679985Z     compiled=True,
2025-05-07T20:32:47.3680061Z )
2025-05-07T20:32:47.3680350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3680523Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3680528Z 
2025-05-07T20:32:47.3680607Z     @given(
2025-05-07T20:32:47.3680735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3680839Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3680957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3681081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3681198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3681281Z     )
2025-05-07T20:32:47.3681545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3681640Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3681721Z         self,
2025-05-07T20:32:47.3681799Z         T: int,
2025-05-07T20:32:47.3681877Z         D: int,
2025-05-07T20:32:47.3681983Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3682074Z         contiguous: bool,
2025-05-07T20:32:47.3682163Z         compiled: bool,
2025-05-07T20:32:47.3682245Z     ) -> None:
2025-05-07T20:32:47.3682342Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3682415Z     
2025-05-07T20:32:47.3682598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3682673Z     
2025-05-07T20:32:47.3682769Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3682899Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3682988Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3683078Z         x0 = x[:, :D]
2025-05-07T20:32:47.3683161Z         x1 = x[:, D:]
2025-05-07T20:32:47.3683236Z     
2025-05-07T20:32:47.3683328Z         if contiguous:
2025-05-07T20:32:47.3683421Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3683514Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3683592Z     
2025-05-07T20:32:47.3683684Z         if scale_ub is not None:
2025-05-07T20:32:47.3683796Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3683935Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3684016Z             )
2025-05-07T20:32:47.3684099Z         else:
2025-05-07T20:32:47.3684198Z             scale_ub_tensor = None
2025-05-07T20:32:47.3684271Z     
2025-05-07T20:32:47.3684409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3684503Z             op = silu_mul_quant
2025-05-07T20:32:47.3684591Z             if compiled:
2025-05-07T20:32:47.3684692Z                 op = torch.compile(op)
2025-05-07T20:32:47.3684802Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3684930Z     
2025-05-07T20:32:47.3685028Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3685032Z 
2025-05-07T20:32:47.3685133Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3685269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3685372Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3685475Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3685858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3685954Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3686510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3686612Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3686981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3687297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3687653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3687753Z     kernel = self.compile(
2025-05-07T20:32:47.3688147Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3688328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3688461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3688468Z 
2025-05-07T20:32:47.3688679Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adb126d0>
2025-05-07T20:32:47.3689483Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3690011Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad3fac00>}
2025-05-07T20:32:47.3690777Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3690979Z context = <triton._C.libtriton.ir.context object at 0x7f07ad582270>
2025-05-07T20:32:47.3690984Z 
2025-05-07T20:32:47.3691159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3691441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3691552Z                            module_map=module_map)
2025-05-07T20:32:47.3691720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3691827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3691909Z E       ^
2025-05-07T20:32:47.3692280Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3692289Z 
2025-05-07T20:32:47.3692718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3692723Z 
2025-05-07T20:32:47.3692828Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3693061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3693142Z     T=1,
2025-05-07T20:32:47.3693221Z     D=7168,
2025-05-07T20:32:47.3693307Z     scale_ub=None,
2025-05-07T20:32:47.3693395Z     contiguous=False,
2025-05-07T20:32:47.3693481Z     compiled=True,
2025-05-07T20:32:47.3693559Z )
2025-05-07T20:32:47.3693784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3694027Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3694033Z 
2025-05-07T20:32:47.3694117Z     @given(
2025-05-07T20:32:47.3694241Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3694345Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3694464Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3694585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3694704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3694782Z     )
2025-05-07T20:32:47.3695038Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3695181Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3695259Z         self,
2025-05-07T20:32:47.3695342Z         T: int,
2025-05-07T20:32:47.3695420Z         D: int,
2025-05-07T20:32:47.3695528Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3695621Z         contiguous: bool,
2025-05-07T20:32:47.3695754Z         compiled: bool,
2025-05-07T20:32:47.3695838Z     ) -> None:
2025-05-07T20:32:47.3695971Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3696046Z     
2025-05-07T20:32:47.3696225Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3696300Z     
2025-05-07T20:32:47.3696393Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3696525Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3696614Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3696699Z         x0 = x[:, :D]
2025-05-07T20:32:47.3696784Z         x1 = x[:, D:]
2025-05-07T20:32:47.3696857Z     
2025-05-07T20:32:47.3696947Z         if contiguous:
2025-05-07T20:32:47.3697040Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3697130Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3697206Z     
2025-05-07T20:32:47.3697300Z         if scale_ub is not None:
2025-05-07T20:32:47.3697407Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3697551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3697630Z             )
2025-05-07T20:32:47.3697710Z         else:
2025-05-07T20:32:47.3697808Z             scale_ub_tensor = None
2025-05-07T20:32:47.3697881Z     
2025-05-07T20:32:47.3698014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3698111Z             op = silu_mul_quant
2025-05-07T20:32:47.3698196Z             if compiled:
2025-05-07T20:32:47.3698305Z                 op = torch.compile(op)
2025-05-07T20:32:47.3698413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3698486Z     
2025-05-07T20:32:47.3698581Z         y_fp8, y_scale = fn()
2025-05-07T20:32:47.3698708Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:47.3698782Z     
2025-05-07T20:32:47.3698951Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3699068Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:47.3699186Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:47.3699321Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:47.3699468Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3699545Z     
2025-05-07T20:32:47.3699648Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:47.3699652Z 
2025-05-07T20:32:47.3699752Z moe/activation_test.py:126: 
2025-05-07T20:32:47.3699886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3699994Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:47.3700131Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:47.3700713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:47.3700818Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:47.3701192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3701475Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3701860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:47.3702129Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:47.3702519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:47.3702696Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:47.3703092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:47.3703171Z     fn()
2025-05-07T20:32:47.3703591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:47.3703675Z     self.fn.run(
2025-05-07T20:32:47.3704070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3704208Z     kernel = self.compile(
2025-05-07T20:32:47.3704607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3704793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3704925Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3704929Z 
2025-05-07T20:32:47.3705141Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad751ed0>
2025-05-07T20:32:47.3705952Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3706477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9c180>}
2025-05-07T20:32:47.3707254Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3707454Z context = <triton._C.libtriton.ir.context object at 0x7f07ada393b0>
2025-05-07T20:32:47.3707459Z 
2025-05-07T20:32:47.3707631Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3707911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3708023Z                            module_map=module_map)
2025-05-07T20:32:47.3708195Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3708300Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:47.3708377Z E       ^
2025-05-07T20:32:47.3708752Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3708758Z 
2025-05-07T20:32:47.3709188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3709192Z 
2025-05-07T20:32:47.3709302Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3709533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3709611Z     T=1,
2025-05-07T20:32:47.3709693Z     D=5120,
2025-05-07T20:32:47.3709780Z     scale_ub=1200.0,
2025-05-07T20:32:47.3709869Z     contiguous=False,
2025-05-07T20:32:47.3709960Z     compiled=True,
2025-05-07T20:32:47.3710035Z )
2025-05-07T20:32:47.3710263Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3710438Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3710443Z 
2025-05-07T20:32:47.3710525Z     @given(
2025-05-07T20:32:47.3710697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3710804Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3710924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3711047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3711163Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3711240Z     )
2025-05-07T20:32:47.3711500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3711595Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3711675Z         self,
2025-05-07T20:32:47.3711796Z         T: int,
2025-05-07T20:32:47.3711875Z         D: int,
2025-05-07T20:32:47.3711976Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3712069Z         contiguous: bool,
2025-05-07T20:32:47.3712159Z         compiled: bool,
2025-05-07T20:32:47.3712242Z     ) -> None:
2025-05-07T20:32:47.3712339Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3712455Z     
2025-05-07T20:32:47.3712637Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3712749Z     
2025-05-07T20:32:47.3712845Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3712977Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3713071Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3713152Z         x0 = x[:, :D]
2025-05-07T20:32:47.3713237Z         x1 = x[:, D:]
2025-05-07T20:32:47.3713487Z     
2025-05-07T20:32:47.3713617Z         if contiguous:
2025-05-07T20:32:47.3713759Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3713863Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3713942Z     
2025-05-07T20:32:47.3714039Z         if scale_ub is not None:
2025-05-07T20:32:47.3714148Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3714291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3714367Z             )
2025-05-07T20:32:47.3714448Z         else:
2025-05-07T20:32:47.3714547Z             scale_ub_tensor = None
2025-05-07T20:32:47.3714625Z     
2025-05-07T20:32:47.3714762Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3714859Z             op = silu_mul_quant
2025-05-07T20:32:47.3714946Z             if compiled:
2025-05-07T20:32:47.3715047Z                 op = torch.compile(op)
2025-05-07T20:32:47.3715166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3715241Z     
2025-05-07T20:32:47.3715334Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3715341Z 
2025-05-07T20:32:47.3715443Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3715577Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3715686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3715788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3716167Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3716269Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3716784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3716887Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3717257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3717489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3717844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3717942Z     kernel = self.compile(
2025-05-07T20:32:47.3718337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3718522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3718655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3718662Z 
2025-05-07T20:32:47.3718973Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ada1e4d0>
2025-05-07T20:32:47.3719777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3720375Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9d300>}
2025-05-07T20:32:47.3721209Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3721411Z context = <triton._C.libtriton.ir.context object at 0x7f07ad0c1ab0>
2025-05-07T20:32:47.3721473Z 
2025-05-07T20:32:47.3721652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3722013Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3722128Z                            module_map=module_map)
2025-05-07T20:32:47.3722299Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3722401Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3722484Z E       ^
2025-05-07T20:32:47.3722852Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3722859Z 
2025-05-07T20:32:47.3723290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3723295Z 
2025-05-07T20:32:47.3723406Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3723637Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3723721Z     T=1,
2025-05-07T20:32:47.3723802Z     D=5120,
2025-05-07T20:32:47.3723889Z     scale_ub=1200.0,
2025-05-07T20:32:47.3723980Z     contiguous=False,
2025-05-07T20:32:47.3724065Z     compiled=False,
2025-05-07T20:32:47.3724139Z )
2025-05-07T20:32:47.3724367Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3724543Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3724547Z 
2025-05-07T20:32:47.3724625Z     @given(
2025-05-07T20:32:47.3724753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3724859Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3724980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3725102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3725217Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3725296Z     )
2025-05-07T20:32:47.3725553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3725651Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3725734Z         self,
2025-05-07T20:32:47.3725812Z         T: int,
2025-05-07T20:32:47.3725890Z         D: int,
2025-05-07T20:32:47.3725995Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3726087Z         contiguous: bool,
2025-05-07T20:32:47.3726174Z         compiled: bool,
2025-05-07T20:32:47.3726259Z     ) -> None:
2025-05-07T20:32:47.3726356Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3726434Z     
2025-05-07T20:32:47.3726609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3726687Z     
2025-05-07T20:32:47.3726784Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3726912Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3727001Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3727086Z         x0 = x[:, :D]
2025-05-07T20:32:47.3727168Z         x1 = x[:, D:]
2025-05-07T20:32:47.3727244Z     
2025-05-07T20:32:47.3727332Z         if contiguous:
2025-05-07T20:32:47.3727475Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3727569Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3727647Z     
2025-05-07T20:32:47.3727741Z         if scale_ub is not None:
2025-05-07T20:32:47.3727850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3727994Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3728071Z             )
2025-05-07T20:32:47.3728150Z         else:
2025-05-07T20:32:47.3728247Z             scale_ub_tensor = None
2025-05-07T20:32:47.3728321Z     
2025-05-07T20:32:47.3728497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3728591Z             op = silu_mul_quant
2025-05-07T20:32:47.3728678Z             if compiled:
2025-05-07T20:32:47.3728782Z                 op = torch.compile(op)
2025-05-07T20:32:47.3728891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3729006Z     
2025-05-07T20:32:47.3729104Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3729111Z 
2025-05-07T20:32:47.3729254Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3729392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3729495Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3729598Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3730115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3730215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3730589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3730827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3731178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3731282Z     kernel = self.compile(
2025-05-07T20:32:47.3731685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3731868Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3732000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3732005Z 
2025-05-07T20:32:47.3732217Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adffad50>
2025-05-07T20:32:47.3733020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3733546Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9e020>}
2025-05-07T20:32:47.3734327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3734532Z context = <triton._C.libtriton.ir.context object at 0x7f07ada47270>
2025-05-07T20:32:47.3734536Z 
2025-05-07T20:32:47.3734706Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3734988Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3735099Z                            module_map=module_map)
2025-05-07T20:32:47.3735268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3735374Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3735452Z E       ^
2025-05-07T20:32:47.3735817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3735828Z 
2025-05-07T20:32:47.3736305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3736310Z 
2025-05-07T20:32:47.3736417Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3736650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3736729Z     T=16384,
2025-05-07T20:32:47.3736807Z     D=5120,
2025-05-07T20:32:47.3736897Z     scale_ub=1200.0,
2025-05-07T20:32:47.3736986Z     contiguous=False,
2025-05-07T20:32:47.3737071Z     compiled=True,
2025-05-07T20:32:47.3737152Z )
2025-05-07T20:32:47.3737379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3737612Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3737616Z 
2025-05-07T20:32:47.3737695Z     @given(
2025-05-07T20:32:47.3737818Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3737964Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3738085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3738243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3738366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3738441Z     )
2025-05-07T20:32:47.3738699Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3738795Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3738874Z         self,
2025-05-07T20:32:47.3738955Z         T: int,
2025-05-07T20:32:47.3739032Z         D: int,
2025-05-07T20:32:47.3739132Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3739229Z         contiguous: bool,
2025-05-07T20:32:47.3739318Z         compiled: bool,
2025-05-07T20:32:47.3739397Z     ) -> None:
2025-05-07T20:32:47.3739497Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3739570Z     
2025-05-07T20:32:47.3739745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3739826Z     
2025-05-07T20:32:47.3739923Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3740055Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3740147Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3740229Z         x0 = x[:, :D]
2025-05-07T20:32:47.3740312Z         x1 = x[:, D:]
2025-05-07T20:32:47.3740386Z     
2025-05-07T20:32:47.3740471Z         if contiguous:
2025-05-07T20:32:47.3740566Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3740656Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3740732Z     
2025-05-07T20:32:47.3740828Z         if scale_ub is not None:
2025-05-07T20:32:47.3740940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3741079Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3741160Z             )
2025-05-07T20:32:47.3741238Z         else:
2025-05-07T20:32:47.3741334Z             scale_ub_tensor = None
2025-05-07T20:32:47.3741411Z     
2025-05-07T20:32:47.3741547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3741644Z             op = silu_mul_quant
2025-05-07T20:32:47.3741733Z             if compiled:
2025-05-07T20:32:47.3741834Z                 op = torch.compile(op)
2025-05-07T20:32:47.3741945Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3742018Z     
2025-05-07T20:32:47.3742110Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3742115Z 
2025-05-07T20:32:47.3742216Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3742348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3742451Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3742558Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3742938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3743037Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3743549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3743699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3744076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3744307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3744658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3744757Z     kernel = self.compile(
2025-05-07T20:32:47.3745151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3745376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3745511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3745515Z 
2025-05-07T20:32:47.3745728Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adb119d0>
2025-05-07T20:32:47.3746618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3747145Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ada9f600>}
2025-05-07T20:32:47.3747919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3748120Z context = <triton._C.libtriton.ir.context object at 0x7f07ad0dc030>
2025-05-07T20:32:47.3748124Z 
2025-05-07T20:32:47.3748303Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3748582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3748694Z                            module_map=module_map)
2025-05-07T20:32:47.3748864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3748965Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3749042Z E       ^
2025-05-07T20:32:47.3749411Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3749415Z 
2025-05-07T20:32:47.3749843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3749850Z 
2025-05-07T20:32:47.3749961Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3750193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3750271Z     T=2048,
2025-05-07T20:32:47.3750352Z     D=7168,
2025-05-07T20:32:47.3750440Z     scale_ub=1200.0,
2025-05-07T20:32:47.3750529Z     contiguous=False,
2025-05-07T20:32:47.3750620Z     compiled=True,
2025-05-07T20:32:47.3750697Z )
2025-05-07T20:32:47.3750922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3751108Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3751112Z 
2025-05-07T20:32:47.3751190Z     @given(
2025-05-07T20:32:47.3751315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3751417Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3751535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3751661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3751777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3751852Z     )
2025-05-07T20:32:47.3752108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3752203Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3752286Z         self,
2025-05-07T20:32:47.3752408Z         T: int,
2025-05-07T20:32:47.3752491Z         D: int,
2025-05-07T20:32:47.3752595Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3752687Z         contiguous: bool,
2025-05-07T20:32:47.3752776Z         compiled: bool,
2025-05-07T20:32:47.3752859Z     ) -> None:
2025-05-07T20:32:47.3752956Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3753030Z     
2025-05-07T20:32:47.3753208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3753286Z     
2025-05-07T20:32:47.3753381Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3753555Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3753644Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3753725Z         x0 = x[:, :D]
2025-05-07T20:32:47.3753810Z         x1 = x[:, D:]
2025-05-07T20:32:47.3753885Z     
2025-05-07T20:32:47.3753975Z         if contiguous:
2025-05-07T20:32:47.3754066Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3754223Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3754303Z     
2025-05-07T20:32:47.3754435Z         if scale_ub is not None:
2025-05-07T20:32:47.3754543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3754683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3754760Z             )
2025-05-07T20:32:47.3754836Z         else:
2025-05-07T20:32:47.3754936Z             scale_ub_tensor = None
2025-05-07T20:32:47.3755010Z     
2025-05-07T20:32:47.3755142Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3755239Z             op = silu_mul_quant
2025-05-07T20:32:47.3755329Z             if compiled:
2025-05-07T20:32:47.3755432Z                 op = torch.compile(op)
2025-05-07T20:32:47.3755541Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3755616Z     
2025-05-07T20:32:47.3755713Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3755718Z 
2025-05-07T20:32:47.3755820Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3755956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3759536Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3759664Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3760059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3760260Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3760780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3760885Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3761259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3761496Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3761852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3761957Z     kernel = self.compile(
2025-05-07T20:32:47.3762364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3762548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3762685Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3762690Z 
2025-05-07T20:32:47.3762906Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf54bd0>
2025-05-07T20:32:47.3763709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3764241Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03c720>}
2025-05-07T20:32:47.3765079Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3765287Z context = <triton._C.libtriton.ir.context object at 0x7f07ad013830>
2025-05-07T20:32:47.3765291Z 
2025-05-07T20:32:47.3765465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3765746Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3765901Z                            module_map=module_map)
2025-05-07T20:32:47.3766070Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3766176Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3766257Z E       ^
2025-05-07T20:32:47.3766627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3766675Z 
2025-05-07T20:32:47.3767155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3767160Z 
2025-05-07T20:32:47.3767268Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3767504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3767584Z     T=1,
2025-05-07T20:32:47.3767664Z     D=5120,
2025-05-07T20:32:47.3767753Z     scale_ub=None,
2025-05-07T20:32:47.3767843Z     contiguous=False,
2025-05-07T20:32:47.3767931Z     compiled=False,
2025-05-07T20:32:47.3768011Z )
2025-05-07T20:32:47.3768238Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3768415Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.3768423Z 
2025-05-07T20:32:47.3768502Z     @given(
2025-05-07T20:32:47.3768626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3768739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3768872Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3769009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3769153Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3769230Z     )
2025-05-07T20:32:47.3769486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3769585Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3769664Z         self,
2025-05-07T20:32:47.3769743Z         T: int,
2025-05-07T20:32:47.3769824Z         D: int,
2025-05-07T20:32:47.3769929Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3770026Z         contiguous: bool,
2025-05-07T20:32:47.3770114Z         compiled: bool,
2025-05-07T20:32:47.3770195Z     ) -> None:
2025-05-07T20:32:47.3770295Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3770370Z     
2025-05-07T20:32:47.3770549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3770630Z     
2025-05-07T20:32:47.3770727Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3770857Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3770951Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3771033Z         x0 = x[:, :D]
2025-05-07T20:32:47.3771116Z         x1 = x[:, D:]
2025-05-07T20:32:47.3771194Z     
2025-05-07T20:32:47.3771279Z         if contiguous:
2025-05-07T20:32:47.3771376Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3771468Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3771545Z     
2025-05-07T20:32:47.3771646Z         if scale_ub is not None:
2025-05-07T20:32:47.3771757Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3771898Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3771977Z             )
2025-05-07T20:32:47.3772056Z         else:
2025-05-07T20:32:47.3772153Z             scale_ub_tensor = None
2025-05-07T20:32:47.3772236Z     
2025-05-07T20:32:47.3772420Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3772517Z             op = silu_mul_quant
2025-05-07T20:32:47.3772607Z             if compiled:
2025-05-07T20:32:47.3772711Z                 op = torch.compile(op)
2025-05-07T20:32:47.3772826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3772901Z     
2025-05-07T20:32:47.3772995Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3773000Z 
2025-05-07T20:32:47.3773102Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3773236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3773390Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3773497Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3774017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3774118Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3774575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3774813Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3775171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3775268Z     kernel = self.compile(
2025-05-07T20:32:47.3775663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3775851Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3775986Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3775990Z 
2025-05-07T20:32:47.3776206Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b22150>
2025-05-07T20:32:47.3777013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3777538Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03d120>}
2025-05-07T20:32:47.3778310Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3778513Z context = <triton._C.libtriton.ir.context object at 0x7f07acf4ec70>
2025-05-07T20:32:47.3778517Z 
2025-05-07T20:32:47.3778693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3778969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3779082Z                            module_map=module_map)
2025-05-07T20:32:47.3779258Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3779363Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3779445Z E       ^
2025-05-07T20:32:47.3779813Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3779818Z 
2025-05-07T20:32:47.3780248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3780252Z 
2025-05-07T20:32:47.3780365Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3780602Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3780685Z     T=4096,
2025-05-07T20:32:47.3780767Z     D=7168,
2025-05-07T20:32:47.3780854Z     scale_ub=1200.0,
2025-05-07T20:32:47.3780947Z     contiguous=False,
2025-05-07T20:32:47.3781033Z     compiled=False,
2025-05-07T20:32:47.3781111Z )
2025-05-07T20:32:47.3781384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3781575Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3781579Z 
2025-05-07T20:32:47.3781660Z     @given(
2025-05-07T20:32:47.3781787Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3781888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3782008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3782129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3782247Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3782369Z     )
2025-05-07T20:32:47.3782624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3782721Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3782803Z         self,
2025-05-07T20:32:47.3782881Z         T: int,
2025-05-07T20:32:47.3783000Z         D: int,
2025-05-07T20:32:47.3783104Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3783197Z         contiguous: bool,
2025-05-07T20:32:47.3783324Z         compiled: bool,
2025-05-07T20:32:47.3783409Z     ) -> None:
2025-05-07T20:32:47.3783506Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3783580Z     
2025-05-07T20:32:47.3783759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3783834Z     
2025-05-07T20:32:47.3783932Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3784060Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3784151Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3784237Z         x0 = x[:, :D]
2025-05-07T20:32:47.3784318Z         x1 = x[:, D:]
2025-05-07T20:32:47.3784392Z     
2025-05-07T20:32:47.3784483Z         if contiguous:
2025-05-07T20:32:47.3784575Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3784666Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3784747Z     
2025-05-07T20:32:47.3784844Z         if scale_ub is not None:
2025-05-07T20:32:47.3784954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3785099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3785177Z             )
2025-05-07T20:32:47.3785256Z         else:
2025-05-07T20:32:47.3785352Z             scale_ub_tensor = None
2025-05-07T20:32:47.3785426Z     
2025-05-07T20:32:47.3785562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3785654Z             op = silu_mul_quant
2025-05-07T20:32:47.3785741Z             if compiled:
2025-05-07T20:32:47.3785846Z                 op = torch.compile(op)
2025-05-07T20:32:47.3785957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3786031Z     
2025-05-07T20:32:47.3786127Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3786131Z 
2025-05-07T20:32:47.3786230Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3786366Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3786473Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3786580Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3787100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3787200Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3787570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3787809Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3788161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3788263Z     kernel = self.compile(
2025-05-07T20:32:47.3788659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3788840Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3789027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3789034Z 
2025-05-07T20:32:47.3789246Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ada1ced0>
2025-05-07T20:32:47.3790049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3790569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03e480>}
2025-05-07T20:32:47.3791396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3791635Z context = <triton._C.libtriton.ir.context object at 0x7f08981cf9f0>
2025-05-07T20:32:47.3791642Z 
2025-05-07T20:32:47.3791851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3792128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3792239Z                            module_map=module_map)
2025-05-07T20:32:47.3792405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3792509Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3792587Z E       ^
2025-05-07T20:32:47.3792954Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3792964Z 
2025-05-07T20:32:47.3793392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3793397Z 
2025-05-07T20:32:47.3793503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3793745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3793826Z     T=16384,
2025-05-07T20:32:47.3793905Z     D=7168,
2025-05-07T20:32:47.3793995Z     scale_ub=None,
2025-05-07T20:32:47.3794083Z     contiguous=True,
2025-05-07T20:32:47.3794167Z     compiled=True,
2025-05-07T20:32:47.3794244Z )
2025-05-07T20:32:47.3794470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3794654Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.3794658Z 
2025-05-07T20:32:47.3794737Z     @given(
2025-05-07T20:32:47.3794862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3794968Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3795086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3795207Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3795326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3795407Z     )
2025-05-07T20:32:47.3795667Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3795767Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3795844Z         self,
2025-05-07T20:32:47.3795926Z         T: int,
2025-05-07T20:32:47.3796003Z         D: int,
2025-05-07T20:32:47.3796104Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3796198Z         contiguous: bool,
2025-05-07T20:32:47.3796285Z         compiled: bool,
2025-05-07T20:32:47.3796364Z     ) -> None:
2025-05-07T20:32:47.3796466Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3796539Z     
2025-05-07T20:32:47.3796716Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3796795Z     
2025-05-07T20:32:47.3796892Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3797021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3797115Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3797198Z         x0 = x[:, :D]
2025-05-07T20:32:47.3797281Z         x1 = x[:, D:]
2025-05-07T20:32:47.3797402Z     
2025-05-07T20:32:47.3797491Z         if contiguous:
2025-05-07T20:32:47.3797588Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3797683Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3797757Z     
2025-05-07T20:32:47.3797852Z         if scale_ub is not None:
2025-05-07T20:32:47.3797960Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3798098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3798179Z             )
2025-05-07T20:32:47.3798256Z         else:
2025-05-07T20:32:47.3798352Z             scale_ub_tensor = None
2025-05-07T20:32:47.3798472Z     
2025-05-07T20:32:47.3798605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3798697Z             op = silu_mul_quant
2025-05-07T20:32:47.3798786Z             if compiled:
2025-05-07T20:32:47.3798889Z                 op = torch.compile(op)
2025-05-07T20:32:47.3799042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3799116Z     
2025-05-07T20:32:47.3799280Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3799285Z 
2025-05-07T20:32:47.3799388Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3799522Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3799626Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3799733Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3800191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3800292Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3800805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3800904Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3801280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3801521Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3801873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3801973Z     kernel = self.compile(
2025-05-07T20:32:47.3802369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3802554Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3802684Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3802691Z 
2025-05-07T20:32:47.3802902Z self = <triton.compiler.compiler.ASTSource object at 0x7f08990e4fd0>
2025-05-07T20:32:47.3803703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3804232Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad03f740>}
2025-05-07T20:32:47.3805002Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3805201Z context = <triton._C.libtriton.ir.context object at 0x7f07ad2b6870>
2025-05-07T20:32:47.3805206Z 
2025-05-07T20:32:47.3805384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3805657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3805767Z                            module_map=module_map)
2025-05-07T20:32:47.3805935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3806038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3806163Z E       ^
2025-05-07T20:32:47.3806538Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3806542Z 
2025-05-07T20:32:47.3806971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3806975Z 
2025-05-07T20:32:47.3807085Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3807316Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3807436Z     T=4096,
2025-05-07T20:32:47.3807521Z     D=5120,
2025-05-07T20:32:47.3807606Z     scale_ub=None,
2025-05-07T20:32:47.3807695Z     contiguous=False,
2025-05-07T20:32:47.3807782Z     compiled=True,
2025-05-07T20:32:47.3807857Z )
2025-05-07T20:32:47.3808083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3808310Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3808315Z 
2025-05-07T20:32:47.3808431Z     @given(
2025-05-07T20:32:47.3808558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3808660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3808778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3808903Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3809020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3809095Z     )
2025-05-07T20:32:47.3809353Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3809452Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3809529Z         self,
2025-05-07T20:32:47.3809612Z         T: int,
2025-05-07T20:32:47.3809689Z         D: int,
2025-05-07T20:32:47.3809793Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3809884Z         contiguous: bool,
2025-05-07T20:32:47.3809975Z         compiled: bool,
2025-05-07T20:32:47.3810062Z     ) -> None:
2025-05-07T20:32:47.3810160Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3810234Z     
2025-05-07T20:32:47.3810412Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3810487Z     
2025-05-07T20:32:47.3810581Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3810713Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3810802Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3810885Z         x0 = x[:, :D]
2025-05-07T20:32:47.3810969Z         x1 = x[:, D:]
2025-05-07T20:32:47.3811042Z     
2025-05-07T20:32:47.3811134Z         if contiguous:
2025-05-07T20:32:47.3811227Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3811317Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3811394Z     
2025-05-07T20:32:47.3811487Z         if scale_ub is not None:
2025-05-07T20:32:47.3811604Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3811748Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3811828Z             )
2025-05-07T20:32:47.3811911Z         else:
2025-05-07T20:32:47.3812007Z             scale_ub_tensor = None
2025-05-07T20:32:47.3812082Z     
2025-05-07T20:32:47.3812217Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3812308Z             op = silu_mul_quant
2025-05-07T20:32:47.3812394Z             if compiled:
2025-05-07T20:32:47.3812501Z                 op = torch.compile(op)
2025-05-07T20:32:47.3812608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3812682Z     
2025-05-07T20:32:47.3812777Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3812785Z 
2025-05-07T20:32:47.3812884Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3813019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3813121Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3813224Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3813977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3814083Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3814597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3814704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3815075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3815309Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3815722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3815820Z     kernel = self.compile(
2025-05-07T20:32:47.3816221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3816463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3816650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3816655Z 
2025-05-07T20:32:47.3816870Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ada1c750>
2025-05-07T20:32:47.3817671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3818197Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad290c20>}
2025-05-07T20:32:47.3818968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3819176Z context = <triton._C.libtriton.ir.context object at 0x7f07ad2ca630>
2025-05-07T20:32:47.3819183Z 
2025-05-07T20:32:47.3819356Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3819631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3819746Z                            module_map=module_map)
2025-05-07T20:32:47.3819913Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3820019Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3820099Z E       ^
2025-05-07T20:32:47.3820468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3820473Z 
2025-05-07T20:32:47.3820906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3820913Z 
2025-05-07T20:32:47.3821021Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3821262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3821342Z     T=4096,
2025-05-07T20:32:47.3821420Z     D=5120,
2025-05-07T20:32:47.3821510Z     scale_ub=1200.0,
2025-05-07T20:32:47.3821599Z     contiguous=False,
2025-05-07T20:32:47.3821685Z     compiled=False,
2025-05-07T20:32:47.3821763Z )
2025-05-07T20:32:47.3821990Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3822177Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3822185Z 
2025-05-07T20:32:47.3822266Z     @given(
2025-05-07T20:32:47.3822391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3822497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3822617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3822739Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3822862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3823009Z     )
2025-05-07T20:32:47.3823272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3823374Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3823452Z         self,
2025-05-07T20:32:47.3823539Z         T: int,
2025-05-07T20:32:47.3823619Z         D: int,
2025-05-07T20:32:47.3823721Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3823816Z         contiguous: bool,
2025-05-07T20:32:47.3823905Z         compiled: bool,
2025-05-07T20:32:47.3823988Z     ) -> None:
2025-05-07T20:32:47.3824085Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3824204Z     
2025-05-07T20:32:47.3824384Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3824460Z     
2025-05-07T20:32:47.3824555Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3824689Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3824821Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3824905Z         x0 = x[:, :D]
2025-05-07T20:32:47.3825026Z         x1 = x[:, D:]
2025-05-07T20:32:47.3825102Z     
2025-05-07T20:32:47.3825187Z         if contiguous:
2025-05-07T20:32:47.3825284Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3825376Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3825451Z     
2025-05-07T20:32:47.3825547Z         if scale_ub is not None:
2025-05-07T20:32:47.3825657Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3825799Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3825876Z             )
2025-05-07T20:32:47.3825955Z         else:
2025-05-07T20:32:47.3826055Z             scale_ub_tensor = None
2025-05-07T20:32:47.3826129Z     
2025-05-07T20:32:47.3826265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3826361Z             op = silu_mul_quant
2025-05-07T20:32:47.3826449Z             if compiled:
2025-05-07T20:32:47.3826555Z                 op = torch.compile(op)
2025-05-07T20:32:47.3826673Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3826749Z     
2025-05-07T20:32:47.3826842Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3826849Z 
2025-05-07T20:32:47.3826950Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3827084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3827194Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3827299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3827816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3827923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3828295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3828528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3828909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3829018Z     kernel = self.compile(
2025-05-07T20:32:47.3829442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3829625Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3829759Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3829763Z 
2025-05-07T20:32:47.3829981Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad751dd0>
2025-05-07T20:32:47.3830790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3831372Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad2916c0>}
2025-05-07T20:32:47.3832143Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3832346Z context = <triton._C.libtriton.ir.context object at 0x7f07ad2034f0>
2025-05-07T20:32:47.3832351Z 
2025-05-07T20:32:47.3832523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3832798Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3832954Z                            module_map=module_map)
2025-05-07T20:32:47.3833123Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3833225Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3833307Z E       ^
2025-05-07T20:32:47.3833723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3833764Z 
2025-05-07T20:32:47.3834205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3834210Z 
2025-05-07T20:32:47.3834320Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3834552Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3834635Z     T=4096,
2025-05-07T20:32:47.3834715Z     D=5120,
2025-05-07T20:32:47.3834801Z     scale_ub=1200.0,
2025-05-07T20:32:47.3834897Z     contiguous=False,
2025-05-07T20:32:47.3834984Z     compiled=True,
2025-05-07T20:32:47.3835062Z )
2025-05-07T20:32:47.3835292Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3835475Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3835482Z 
2025-05-07T20:32:47.3835567Z     @given(
2025-05-07T20:32:47.3835693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3835797Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3835920Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3836041Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3836162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3836239Z     )
2025-05-07T20:32:47.3836496Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3836596Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3836675Z         self,
2025-05-07T20:32:47.3836758Z         T: int,
2025-05-07T20:32:47.3836840Z         D: int,
2025-05-07T20:32:47.3836941Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3837035Z         contiguous: bool,
2025-05-07T20:32:47.3837126Z         compiled: bool,
2025-05-07T20:32:47.3837205Z     ) -> None:
2025-05-07T20:32:47.3837307Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3837385Z     
2025-05-07T20:32:47.3837565Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3837641Z     
2025-05-07T20:32:47.3837737Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3837868Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3837964Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3838045Z         x0 = x[:, :D]
2025-05-07T20:32:47.3838129Z         x1 = x[:, D:]
2025-05-07T20:32:47.3838206Z     
2025-05-07T20:32:47.3838294Z         if contiguous:
2025-05-07T20:32:47.3838388Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3838484Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3838559Z     
2025-05-07T20:32:47.3838655Z         if scale_ub is not None:
2025-05-07T20:32:47.3838766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3838904Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3838981Z             )
2025-05-07T20:32:47.3839063Z         else:
2025-05-07T20:32:47.3839205Z             scale_ub_tensor = None
2025-05-07T20:32:47.3839285Z     
2025-05-07T20:32:47.3839421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3839513Z             op = silu_mul_quant
2025-05-07T20:32:47.3839602Z             if compiled:
2025-05-07T20:32:47.3839704Z                 op = torch.compile(op)
2025-05-07T20:32:47.3839811Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3839888Z     
2025-05-07T20:32:47.3839979Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3839984Z 
2025-05-07T20:32:47.3840153Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3840334Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3840437Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3840543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3840922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3841072Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3841623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3841726Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3842096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3842330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3842682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3842783Z     kernel = self.compile(
2025-05-07T20:32:47.3843180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3843361Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3843496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3843503Z 
2025-05-07T20:32:47.3843716Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b21ed0>
2025-05-07T20:32:47.3844518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3845040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad292fc0>}
2025-05-07T20:32:47.3845809Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3846009Z context = <triton._C.libtriton.ir.context object at 0x7f07acfafcb0>
2025-05-07T20:32:47.3846017Z 
2025-05-07T20:32:47.3846190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3846470Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3846581Z                            module_map=module_map)
2025-05-07T20:32:47.3846746Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3846851Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3846930Z E       ^
2025-05-07T20:32:47.3847295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3847306Z 
2025-05-07T20:32:47.3847734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3847738Z 
2025-05-07T20:32:47.3847844Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3848077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3848160Z     T=2048,
2025-05-07T20:32:47.3848282Z     D=7168,
2025-05-07T20:32:47.3848375Z     scale_ub=1200.0,
2025-05-07T20:32:47.3848466Z     contiguous=False,
2025-05-07T20:32:47.3848554Z     compiled=False,
2025-05-07T20:32:47.3848632Z )
2025-05-07T20:32:47.3848859Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3849045Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3849050Z 
2025-05-07T20:32:47.3849128Z     @given(
2025-05-07T20:32:47.3849252Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3849400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3849521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3849642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3849762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3849878Z     )
2025-05-07T20:32:47.3850137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3850295Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3850373Z         self,
2025-05-07T20:32:47.3850454Z         T: int,
2025-05-07T20:32:47.3850531Z         D: int,
2025-05-07T20:32:47.3850631Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3850726Z         contiguous: bool,
2025-05-07T20:32:47.3850815Z         compiled: bool,
2025-05-07T20:32:47.3850893Z     ) -> None:
2025-05-07T20:32:47.3850994Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3851069Z     
2025-05-07T20:32:47.3851244Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3851328Z     
2025-05-07T20:32:47.3851422Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3851550Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3851644Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3851725Z         x0 = x[:, :D]
2025-05-07T20:32:47.3851811Z         x1 = x[:, D:]
2025-05-07T20:32:47.3851885Z     
2025-05-07T20:32:47.3851972Z         if contiguous:
2025-05-07T20:32:47.3852071Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3852163Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3852238Z     
2025-05-07T20:32:47.3852335Z         if scale_ub is not None:
2025-05-07T20:32:47.3852444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3852584Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3852664Z             )
2025-05-07T20:32:47.3852740Z         else:
2025-05-07T20:32:47.3852836Z             scale_ub_tensor = None
2025-05-07T20:32:47.3852915Z     
2025-05-07T20:32:47.3853048Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3853143Z             op = silu_mul_quant
2025-05-07T20:32:47.3853229Z             if compiled:
2025-05-07T20:32:47.3853330Z                 op = torch.compile(op)
2025-05-07T20:32:47.3853441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3853519Z     
2025-05-07T20:32:47.3853613Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3853617Z 
2025-05-07T20:32:47.3853723Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3853856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3853958Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3854065Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3854578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3854680Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3855056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3855290Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3855645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3855745Z     kernel = self.compile(
2025-05-07T20:32:47.3856191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3856377Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3856508Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3856512Z 
2025-05-07T20:32:47.3856726Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b21e50>
2025-05-07T20:32:47.3857524Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3858090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ad293ec0>}
2025-05-07T20:32:47.3858949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3859184Z context = <triton._C.libtriton.ir.context object at 0x7f07ad605fb0>
2025-05-07T20:32:47.3859190Z 
2025-05-07T20:32:47.3859365Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3859637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3859749Z                            module_map=module_map)
2025-05-07T20:32:47.3859917Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3860020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3860102Z E       ^
2025-05-07T20:32:47.3860468Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3860476Z 
2025-05-07T20:32:47.3860910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3860921Z 
2025-05-07T20:32:47.3861026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3861257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3861339Z     T=1,
2025-05-07T20:32:47.3861421Z     D=7168,
2025-05-07T20:32:47.3861504Z     scale_ub=None,
2025-05-07T20:32:47.3861594Z     contiguous=True,
2025-05-07T20:32:47.3861680Z     compiled=False,
2025-05-07T20:32:47.3861754Z )
2025-05-07T20:32:47.3861987Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3862157Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.3862162Z 
2025-05-07T20:32:47.3862240Z     @given(
2025-05-07T20:32:47.3862366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3862471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3862594Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3862717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3862834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3862912Z     )
2025-05-07T20:32:47.3863166Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3863262Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3863344Z         self,
2025-05-07T20:32:47.3863422Z         T: int,
2025-05-07T20:32:47.3863500Z         D: int,
2025-05-07T20:32:47.3863605Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3863700Z         contiguous: bool,
2025-05-07T20:32:47.3863790Z         compiled: bool,
2025-05-07T20:32:47.3863869Z     ) -> None:
2025-05-07T20:32:47.3863965Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3864044Z     
2025-05-07T20:32:47.3864220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3864298Z     
2025-05-07T20:32:47.3864441Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3864574Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3864664Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3864749Z         x0 = x[:, :D]
2025-05-07T20:32:47.3864830Z         x1 = x[:, D:]
2025-05-07T20:32:47.3864903Z     
2025-05-07T20:32:47.3864991Z         if contiguous:
2025-05-07T20:32:47.3865084Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3865174Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3865251Z     
2025-05-07T20:32:47.3865344Z         if scale_ub is not None:
2025-05-07T20:32:47.3865496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3865636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3865713Z             )
2025-05-07T20:32:47.3865792Z         else:
2025-05-07T20:32:47.3865888Z             scale_ub_tensor = None
2025-05-07T20:32:47.3866005Z     
2025-05-07T20:32:47.3866141Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3866236Z             op = silu_mul_quant
2025-05-07T20:32:47.3866364Z             if compiled:
2025-05-07T20:32:47.3866470Z                 op = torch.compile(op)
2025-05-07T20:32:47.3866579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3866652Z     
2025-05-07T20:32:47.3866746Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3866751Z 
2025-05-07T20:32:47.3866849Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3866985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3867087Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3867192Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3867710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3867810Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3868187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3868423Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3868777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3868875Z     kernel = self.compile(
2025-05-07T20:32:47.3869269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3869450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3869586Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3869590Z 
2025-05-07T20:32:47.3869801Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adffbbd0>
2025-05-07T20:32:47.3870609Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3871136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acf38cc0>}
2025-05-07T20:32:47.3871903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3872104Z context = <triton._C.libtriton.ir.context object at 0x7f07ad65b670>
2025-05-07T20:32:47.3872111Z 
2025-05-07T20:32:47.3872282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3872563Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3872673Z                            module_map=module_map)
2025-05-07T20:32:47.3872886Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3872995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3873074Z E       ^
2025-05-07T20:32:47.3873442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3873447Z 
2025-05-07T20:32:47.3873874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3873878Z 
2025-05-07T20:32:47.3873984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3874216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3874336Z     T=16384,
2025-05-07T20:32:47.3874416Z     D=7168,
2025-05-07T20:32:47.3874506Z     scale_ub=1200.0,
2025-05-07T20:32:47.3874595Z     contiguous=False,
2025-05-07T20:32:47.3874683Z     compiled=True,
2025-05-07T20:32:47.3874757Z )
2025-05-07T20:32:47.3875026Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3875257Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3875262Z 
2025-05-07T20:32:47.3875343Z     @given(
2025-05-07T20:32:47.3875465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3875571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3875689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3875810Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3875929Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3876007Z     )
2025-05-07T20:32:47.3876266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3876361Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3876438Z         self,
2025-05-07T20:32:47.3876519Z         T: int,
2025-05-07T20:32:47.3876597Z         D: int,
2025-05-07T20:32:47.3876700Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3876795Z         contiguous: bool,
2025-05-07T20:32:47.3876885Z         compiled: bool,
2025-05-07T20:32:47.3876966Z     ) -> None:
2025-05-07T20:32:47.3877067Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3877140Z     
2025-05-07T20:32:47.3877314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3877392Z     
2025-05-07T20:32:47.3877485Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3877616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3877705Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3877786Z         x0 = x[:, :D]
2025-05-07T20:32:47.3877873Z         x1 = x[:, D:]
2025-05-07T20:32:47.3877946Z     
2025-05-07T20:32:47.3878031Z         if contiguous:
2025-05-07T20:32:47.3878126Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3878216Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3878289Z     
2025-05-07T20:32:47.3878386Z         if scale_ub is not None:
2025-05-07T20:32:47.3878497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3878642Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3878722Z             )
2025-05-07T20:32:47.3878802Z         else:
2025-05-07T20:32:47.3878900Z             scale_ub_tensor = None
2025-05-07T20:32:47.3878975Z     
2025-05-07T20:32:47.3879108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3879203Z             op = silu_mul_quant
2025-05-07T20:32:47.3879288Z             if compiled:
2025-05-07T20:32:47.3879390Z                 op = torch.compile(op)
2025-05-07T20:32:47.3879501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3879578Z     
2025-05-07T20:32:47.3879670Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3879674Z 
2025-05-07T20:32:47.3879776Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3879910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3880013Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3880184Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3880613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3880712Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3881221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3881320Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3881693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3881966Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3882320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3882416Z     kernel = self.compile(
2025-05-07T20:32:47.3882815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3883097Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3883232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3883237Z 
2025-05-07T20:32:47.3883450Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf56750>
2025-05-07T20:32:47.3884254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3884779Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acf3a0c0>}
2025-05-07T20:32:47.3885548Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3889247Z context = <triton._C.libtriton.ir.context object at 0x7f07ad6d2330>
2025-05-07T20:32:47.3889257Z 
2025-05-07T20:32:47.3889447Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3889724Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3889842Z                            module_map=module_map)
2025-05-07T20:32:47.3890014Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3890121Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3890205Z E       ^
2025-05-07T20:32:47.3890576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3890580Z 
2025-05-07T20:32:47.3891022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3891030Z 
2025-05-07T20:32:47.3891143Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3891378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3891461Z     T=1,
2025-05-07T20:32:47.3891539Z     D=7168,
2025-05-07T20:32:47.3891624Z     scale_ub=None,
2025-05-07T20:32:47.3891717Z     contiguous=False,
2025-05-07T20:32:47.3891803Z     compiled=False,
2025-05-07T20:32:47.3891882Z )
2025-05-07T20:32:47.3892110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3892285Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.3892292Z 
2025-05-07T20:32:47.3892375Z     @given(
2025-05-07T20:32:47.3892499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3892603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3892724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3892849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3893033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3893118Z     )
2025-05-07T20:32:47.3893376Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3893475Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3893553Z         self,
2025-05-07T20:32:47.3893632Z         T: int,
2025-05-07T20:32:47.3893713Z         D: int,
2025-05-07T20:32:47.3893815Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3893908Z         contiguous: bool,
2025-05-07T20:32:47.3893999Z         compiled: bool,
2025-05-07T20:32:47.3894124Z     ) -> None:
2025-05-07T20:32:47.3894225Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3894302Z     
2025-05-07T20:32:47.3894483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3894558Z     
2025-05-07T20:32:47.3894656Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3894826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3894923Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3895047Z         x0 = x[:, :D]
2025-05-07T20:32:47.3895131Z         x1 = x[:, D:]
2025-05-07T20:32:47.3895209Z     
2025-05-07T20:32:47.3895295Z         if contiguous:
2025-05-07T20:32:47.3895390Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3895484Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3895558Z     
2025-05-07T20:32:47.3895652Z         if scale_ub is not None:
2025-05-07T20:32:47.3895766Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3895908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3895987Z             )
2025-05-07T20:32:47.3896068Z         else:
2025-05-07T20:32:47.3896165Z             scale_ub_tensor = None
2025-05-07T20:32:47.3896239Z     
2025-05-07T20:32:47.3896378Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3896471Z             op = silu_mul_quant
2025-05-07T20:32:47.3896563Z             if compiled:
2025-05-07T20:32:47.3896669Z                 op = torch.compile(op)
2025-05-07T20:32:47.3896782Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3896859Z     
2025-05-07T20:32:47.3896953Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3896957Z 
2025-05-07T20:32:47.3897058Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3897196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3897302Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3897406Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3897929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3898034Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3898411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3898652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3899010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3899112Z     kernel = self.compile(
2025-05-07T20:32:47.3899508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3899694Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3899827Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3899831Z 
2025-05-07T20:32:47.3900047Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adff92d0>
2025-05-07T20:32:47.3900858Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3901432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acf3ac00>}
2025-05-07T20:32:47.3902204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3902404Z context = <triton._C.libtriton.ir.context object at 0x7f0898ba6d70>
2025-05-07T20:32:47.3902408Z 
2025-05-07T20:32:47.3902580Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3902930Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3903041Z                            module_map=module_map)
2025-05-07T20:32:47.3903212Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3903353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3903432Z E       ^
2025-05-07T20:32:47.3903845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3903850Z 
2025-05-07T20:32:47.3904282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3904286Z 
2025-05-07T20:32:47.3904395Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3904628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3904710Z     T=2048,
2025-05-07T20:32:47.3904794Z     D=7168,
2025-05-07T20:32:47.3904881Z     scale_ub=None,
2025-05-07T20:32:47.3904971Z     contiguous=False,
2025-05-07T20:32:47.3905058Z     compiled=True,
2025-05-07T20:32:47.3905133Z )
2025-05-07T20:32:47.3905359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3905544Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3905552Z 
2025-05-07T20:32:47.3905635Z     @given(
2025-05-07T20:32:47.3905763Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3905867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3905986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3906109Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3906227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3906303Z     )
2025-05-07T20:32:47.3906562Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3906662Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3906741Z         self,
2025-05-07T20:32:47.3906824Z         T: int,
2025-05-07T20:32:47.3906903Z         D: int,
2025-05-07T20:32:47.3907007Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3907101Z         contiguous: bool,
2025-05-07T20:32:47.3907189Z         compiled: bool,
2025-05-07T20:32:47.3907275Z     ) -> None:
2025-05-07T20:32:47.3907377Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3907453Z     
2025-05-07T20:32:47.3907633Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3907708Z     
2025-05-07T20:32:47.3907803Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3907935Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3908026Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3908108Z         x0 = x[:, :D]
2025-05-07T20:32:47.3908194Z         x1 = x[:, D:]
2025-05-07T20:32:47.3908271Z     
2025-05-07T20:32:47.3908356Z         if contiguous:
2025-05-07T20:32:47.3908456Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3908549Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3908631Z     
2025-05-07T20:32:47.3908724Z         if scale_ub is not None:
2025-05-07T20:32:47.3908833Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3908977Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3909057Z             )
2025-05-07T20:32:47.3909183Z         else:
2025-05-07T20:32:47.3909288Z             scale_ub_tensor = None
2025-05-07T20:32:47.3909363Z     
2025-05-07T20:32:47.3909497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3909593Z             op = silu_mul_quant
2025-05-07T20:32:47.3909680Z             if compiled:
2025-05-07T20:32:47.3909783Z                 op = torch.compile(op)
2025-05-07T20:32:47.3909895Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3909971Z     
2025-05-07T20:32:47.3910067Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3910113Z 
2025-05-07T20:32:47.3910214Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3910348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3910457Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3910559Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3910940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3911121Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3911635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3911738Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3912108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3912341Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3912696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3912796Z     kernel = self.compile(
2025-05-07T20:32:47.3913192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3913597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3913782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3913787Z 
2025-05-07T20:32:47.3914009Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa6e50>
2025-05-07T20:32:47.3914810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3915331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8c2c0>}
2025-05-07T20:32:47.3916103Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3916306Z context = <triton._C.libtriton.ir.context object at 0x7f0898bf7a70>
2025-05-07T20:32:47.3916313Z 
2025-05-07T20:32:47.3916493Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3916768Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3916882Z                            module_map=module_map)
2025-05-07T20:32:47.3917052Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3917155Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3917237Z E       ^
2025-05-07T20:32:47.3917605Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3917612Z 
2025-05-07T20:32:47.3918043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3918048Z 
2025-05-07T20:32:47.3918159Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3918489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3918575Z     T=4096,
2025-05-07T20:32:47.3918654Z     D=7168,
2025-05-07T20:32:47.3918738Z     scale_ub=None,
2025-05-07T20:32:47.3918830Z     contiguous=False,
2025-05-07T20:32:47.3918916Z     compiled=True,
2025-05-07T20:32:47.3919000Z )
2025-05-07T20:32:47.3919273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3919454Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3919458Z 
2025-05-07T20:32:47.3919537Z     @given(
2025-05-07T20:32:47.3919758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3919861Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3919984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3920172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3920291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3920433Z     )
2025-05-07T20:32:47.3920741Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3920840Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3920922Z         self,
2025-05-07T20:32:47.3921000Z         T: int,
2025-05-07T20:32:47.3921077Z         D: int,
2025-05-07T20:32:47.3921181Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3921272Z         contiguous: bool,
2025-05-07T20:32:47.3921360Z         compiled: bool,
2025-05-07T20:32:47.3921444Z     ) -> None:
2025-05-07T20:32:47.3921541Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3921621Z     
2025-05-07T20:32:47.3921796Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3921871Z     
2025-05-07T20:32:47.3921967Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3922099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3922190Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3922278Z         x0 = x[:, :D]
2025-05-07T20:32:47.3922360Z         x1 = x[:, D:]
2025-05-07T20:32:47.3922435Z     
2025-05-07T20:32:47.3922527Z         if contiguous:
2025-05-07T20:32:47.3922620Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3922710Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3922786Z     
2025-05-07T20:32:47.3922879Z         if scale_ub is not None:
2025-05-07T20:32:47.3922991Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3923130Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3923206Z             )
2025-05-07T20:32:47.3923288Z         else:
2025-05-07T20:32:47.3923388Z             scale_ub_tensor = None
2025-05-07T20:32:47.3923463Z     
2025-05-07T20:32:47.3923598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3923690Z             op = silu_mul_quant
2025-05-07T20:32:47.3923776Z             if compiled:
2025-05-07T20:32:47.3923880Z                 op = torch.compile(op)
2025-05-07T20:32:47.3923990Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3924067Z     
2025-05-07T20:32:47.3924167Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3924172Z 
2025-05-07T20:32:47.3924271Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3924407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3924510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3924611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3924993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3925088Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3925599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3925702Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3926071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3926358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3926711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3926807Z     kernel = self.compile(
2025-05-07T20:32:47.3927204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3927385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3927517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3927564Z 
2025-05-07T20:32:47.3927777Z self = <triton.compiler.compiler.ASTSource object at 0x7f07adff82d0>
2025-05-07T20:32:47.3928576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3929178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8cd60>}
2025-05-07T20:32:47.3929946Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3930149Z context = <triton._C.libtriton.ir.context object at 0x7f0898b89fb0>
2025-05-07T20:32:47.3930153Z 
2025-05-07T20:32:47.3930325Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3930598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3930711Z                            module_map=module_map)
2025-05-07T20:32:47.3930877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3930986Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3931064Z E       ^
2025-05-07T20:32:47.3931432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3931437Z 
2025-05-07T20:32:47.3931871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3931876Z 
2025-05-07T20:32:47.3931982Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3932216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3932298Z     T=16384,
2025-05-07T20:32:47.3932376Z     D=5120,
2025-05-07T20:32:47.3932464Z     scale_ub=1200.0,
2025-05-07T20:32:47.3932551Z     contiguous=False,
2025-05-07T20:32:47.3932636Z     compiled=False,
2025-05-07T20:32:47.3932714Z )
2025-05-07T20:32:47.3932939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3933132Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.3933139Z 
2025-05-07T20:32:47.3933220Z     @given(
2025-05-07T20:32:47.3933342Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3933445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3933565Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3933685Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3933803Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3933878Z     )
2025-05-07T20:32:47.3934132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3934238Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3934319Z         self,
2025-05-07T20:32:47.3934396Z         T: int,
2025-05-07T20:32:47.3934476Z         D: int,
2025-05-07T20:32:47.3934576Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3934671Z         contiguous: bool,
2025-05-07T20:32:47.3934762Z         compiled: bool,
2025-05-07T20:32:47.3934887Z     ) -> None:
2025-05-07T20:32:47.3934988Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3935065Z     
2025-05-07T20:32:47.3935240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3935318Z     
2025-05-07T20:32:47.3935411Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3935538Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3935631Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3935712Z         x0 = x[:, :D]
2025-05-07T20:32:47.3935793Z         x1 = x[:, D:]
2025-05-07T20:32:47.3935913Z     
2025-05-07T20:32:47.3935998Z         if contiguous:
2025-05-07T20:32:47.3936091Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3936186Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3936259Z     
2025-05-07T20:32:47.3936352Z         if scale_ub is not None:
2025-05-07T20:32:47.3936461Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3936686Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3936766Z             )
2025-05-07T20:32:47.3936883Z         else:
2025-05-07T20:32:47.3936981Z             scale_ub_tensor = None
2025-05-07T20:32:47.3937059Z     
2025-05-07T20:32:47.3937192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3937284Z             op = silu_mul_quant
2025-05-07T20:32:47.3937374Z             if compiled:
2025-05-07T20:32:47.3937475Z                 op = torch.compile(op)
2025-05-07T20:32:47.3937583Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3937659Z     
2025-05-07T20:32:47.3937753Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3937757Z 
2025-05-07T20:32:47.3937862Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3937995Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3938099Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3938206Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3938728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3938832Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3939203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3939437Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3939791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3939888Z     kernel = self.compile(
2025-05-07T20:32:47.3940284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3940465Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3940596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3940604Z 
2025-05-07T20:32:47.3940823Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898b236d0>
2025-05-07T20:32:47.3941618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3942140Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8dc60>}
2025-05-07T20:32:47.3942904Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3943106Z context = <triton._C.libtriton.ir.context object at 0x7f07ace8adf0>
2025-05-07T20:32:47.3943112Z 
2025-05-07T20:32:47.3943285Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3943603Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3943719Z                            module_map=module_map)
2025-05-07T20:32:47.3943885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3943987Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3944068Z E       ^
2025-05-07T20:32:47.3944432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3944436Z 
2025-05-07T20:32:47.3944908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3944912Z 
2025-05-07T20:32:47.3945018Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3945248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3945368Z     T=16384,
2025-05-07T20:32:47.3945446Z     D=5120,
2025-05-07T20:32:47.3945534Z     scale_ub=1200.0,
2025-05-07T20:32:47.3945660Z     contiguous=True,
2025-05-07T20:32:47.3945746Z     compiled=True,
2025-05-07T20:32:47.3945820Z )
2025-05-07T20:32:47.3946049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3946231Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.3946235Z 
2025-05-07T20:32:47.3946317Z     @given(
2025-05-07T20:32:47.3946440Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3946542Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3946668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3946789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3946906Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3946983Z     )
2025-05-07T20:32:47.3947239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3947341Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3947423Z         self,
2025-05-07T20:32:47.3947501Z         T: int,
2025-05-07T20:32:47.3947583Z         D: int,
2025-05-07T20:32:47.3947684Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3947775Z         contiguous: bool,
2025-05-07T20:32:47.3947865Z         compiled: bool,
2025-05-07T20:32:47.3947944Z     ) -> None:
2025-05-07T20:32:47.3948041Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3948117Z     
2025-05-07T20:32:47.3948292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3948369Z     
2025-05-07T20:32:47.3948466Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3948599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3948691Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3948776Z         x0 = x[:, :D]
2025-05-07T20:32:47.3948857Z         x1 = x[:, D:]
2025-05-07T20:32:47.3948934Z     
2025-05-07T20:32:47.3949023Z         if contiguous:
2025-05-07T20:32:47.3949126Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3949221Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3949295Z     
2025-05-07T20:32:47.3949390Z         if scale_ub is not None:
2025-05-07T20:32:47.3949497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3949636Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3949715Z             )
2025-05-07T20:32:47.3949792Z         else:
2025-05-07T20:32:47.3949887Z             scale_ub_tensor = None
2025-05-07T20:32:47.3949964Z     
2025-05-07T20:32:47.3950097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3950193Z             op = silu_mul_quant
2025-05-07T20:32:47.3950283Z             if compiled:
2025-05-07T20:32:47.3950384Z                 op = torch.compile(op)
2025-05-07T20:32:47.3950494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3950568Z     
2025-05-07T20:32:47.3950662Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3950666Z 
2025-05-07T20:32:47.3950837Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3950973Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3951076Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3951181Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3951560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3951659Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3952169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3952309Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3952680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3952912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3953342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3953443Z     kernel = self.compile(
2025-05-07T20:32:47.3953836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3954019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3954149Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3954154Z 
2025-05-07T20:32:47.3954363Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace69150>
2025-05-07T20:32:47.3955168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3955693Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f0898b8f380>}
2025-05-07T20:32:47.3956462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3956661Z context = <triton._C.libtriton.ir.context object at 0x7f07ace17fb0>
2025-05-07T20:32:47.3956666Z 
2025-05-07T20:32:47.3956839Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3957116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3957228Z                            module_map=module_map)
2025-05-07T20:32:47.3957397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3957499Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3957580Z E       ^
2025-05-07T20:32:47.3957952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3957957Z 
2025-05-07T20:32:47.3958384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3958389Z 
2025-05-07T20:32:47.3958499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3958730Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3958810Z     T=16384,
2025-05-07T20:32:47.3958895Z     D=5120,
2025-05-07T20:32:47.3958986Z     scale_ub=None,
2025-05-07T20:32:47.3959094Z     contiguous=False,
2025-05-07T20:32:47.3959193Z     compiled=True,
2025-05-07T20:32:47.3959281Z )
2025-05-07T20:32:47.3959506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3959692Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3959700Z 
2025-05-07T20:32:47.3959778Z     @given(
2025-05-07T20:32:47.3959951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3960054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3960243Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3960373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3960491Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3960567Z     )
2025-05-07T20:32:47.3960827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3960924Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3961047Z         self,
2025-05-07T20:32:47.3961129Z         T: int,
2025-05-07T20:32:47.3961208Z         D: int,
2025-05-07T20:32:47.3961311Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3961405Z         contiguous: bool,
2025-05-07T20:32:47.3961492Z         compiled: bool,
2025-05-07T20:32:47.3961575Z     ) -> None:
2025-05-07T20:32:47.3961713Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3961787Z     
2025-05-07T20:32:47.3962004Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3962081Z     
2025-05-07T20:32:47.3962176Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3962307Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3962397Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3962478Z         x0 = x[:, :D]
2025-05-07T20:32:47.3962563Z         x1 = x[:, D:]
2025-05-07T20:32:47.3962637Z     
2025-05-07T20:32:47.3962721Z         if contiguous:
2025-05-07T20:32:47.3962820Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3962915Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3962991Z     
2025-05-07T20:32:47.3963083Z         if scale_ub is not None:
2025-05-07T20:32:47.3963190Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3963330Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3963411Z             )
2025-05-07T20:32:47.3963488Z         else:
2025-05-07T20:32:47.3963588Z             scale_ub_tensor = None
2025-05-07T20:32:47.3963667Z     
2025-05-07T20:32:47.3963799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3963895Z             op = silu_mul_quant
2025-05-07T20:32:47.3963982Z             if compiled:
2025-05-07T20:32:47.3964083Z                 op = torch.compile(op)
2025-05-07T20:32:47.3964194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3964267Z     
2025-05-07T20:32:47.3964361Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3964366Z 
2025-05-07T20:32:47.3964464Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3964600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3964705Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3964807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3965186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3965290Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3965802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3965905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3966273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3966504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3966857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3966955Z     kernel = self.compile(
2025-05-07T20:32:47.3967348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3967535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3967715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3967720Z 
2025-05-07T20:32:47.3967939Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf542d0>
2025-05-07T20:32:47.3968737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3969260Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2c5e0>}
2025-05-07T20:32:47.3970068Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3970266Z context = <triton._C.libtriton.ir.context object at 0x7f07adec78f0>
2025-05-07T20:32:47.3970309Z 
2025-05-07T20:32:47.3970523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3970796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3970910Z                            module_map=module_map)
2025-05-07T20:32:47.3971077Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3971179Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3971260Z E       ^
2025-05-07T20:32:47.3971627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3971634Z 
2025-05-07T20:32:47.3972065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3972072Z 
2025-05-07T20:32:47.3972178Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3972411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3972497Z     T=2048,
2025-05-07T20:32:47.3972577Z     D=5120,
2025-05-07T20:32:47.3972660Z     scale_ub=None,
2025-05-07T20:32:47.3972750Z     contiguous=False,
2025-05-07T20:32:47.3972835Z     compiled=True,
2025-05-07T20:32:47.3972909Z )
2025-05-07T20:32:47.3973136Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3973316Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.3973321Z 
2025-05-07T20:32:47.3973399Z     @given(
2025-05-07T20:32:47.3973524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3973628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3973748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3973869Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3973985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3974068Z     )
2025-05-07T20:32:47.3974325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3974425Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3974505Z         self,
2025-05-07T20:32:47.3974584Z         T: int,
2025-05-07T20:32:47.3974663Z         D: int,
2025-05-07T20:32:47.3974768Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3974859Z         contiguous: bool,
2025-05-07T20:32:47.3974949Z         compiled: bool,
2025-05-07T20:32:47.3975028Z     ) -> None:
2025-05-07T20:32:47.3975125Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3975203Z     
2025-05-07T20:32:47.3975378Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3975459Z     
2025-05-07T20:32:47.3975556Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3975685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3975775Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3975861Z         x0 = x[:, :D]
2025-05-07T20:32:47.3975945Z         x1 = x[:, D:]
2025-05-07T20:32:47.3976019Z     
2025-05-07T20:32:47.3976154Z         if contiguous:
2025-05-07T20:32:47.3976250Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3976340Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3976416Z     
2025-05-07T20:32:47.3976509Z         if scale_ub is not None:
2025-05-07T20:32:47.3976619Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3976758Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3976835Z             )
2025-05-07T20:32:47.3976917Z         else:
2025-05-07T20:32:47.3977013Z             scale_ub_tensor = None
2025-05-07T20:32:47.3977128Z     
2025-05-07T20:32:47.3977264Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3977355Z             op = silu_mul_quant
2025-05-07T20:32:47.3977442Z             if compiled:
2025-05-07T20:32:47.3977547Z                 op = torch.compile(op)
2025-05-07T20:32:47.3977655Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3977771Z     
2025-05-07T20:32:47.3977870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3977875Z 
2025-05-07T20:32:47.3978034Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3978172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3978276Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3978379Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3978760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3978857Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3979366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3979472Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3979841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3980081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3980434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3980530Z     kernel = self.compile(
2025-05-07T20:32:47.3980926Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3981106Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3981239Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3981246Z 
2025-05-07T20:32:47.3981457Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace68cd0>
2025-05-07T20:32:47.3982253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3982790Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2d440>}
2025-05-07T20:32:47.3983557Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3983757Z context = <triton._C.libtriton.ir.context object at 0x7f07ade97030>
2025-05-07T20:32:47.3983762Z 
2025-05-07T20:32:47.3983932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3984207Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3984320Z                            module_map=module_map)
2025-05-07T20:32:47.3984486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3984594Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3984673Z E       ^
2025-05-07T20:32:47.3985086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3985092Z 
2025-05-07T20:32:47.3985525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3985530Z 
2025-05-07T20:32:47.3985637Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3985871Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3985949Z     T=2048,
2025-05-07T20:32:47.3986069Z     D=5120,
2025-05-07T20:32:47.3986157Z     scale_ub=1200.0,
2025-05-07T20:32:47.3986245Z     contiguous=False,
2025-05-07T20:32:47.3986330Z     compiled=True,
2025-05-07T20:32:47.3986407Z )
2025-05-07T20:32:47.3986634Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.3986815Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.3986863Z 
2025-05-07T20:32:47.3986947Z     @given(
2025-05-07T20:32:47.3987108Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.3987217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.3987335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.3987454Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.3987573Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.3987649Z     )
2025-05-07T20:32:47.3987901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.3988002Z     def test_silu_mul_quant(
2025-05-07T20:32:47.3988080Z         self,
2025-05-07T20:32:47.3988157Z         T: int,
2025-05-07T20:32:47.3988239Z         D: int,
2025-05-07T20:32:47.3988339Z         scale_ub: Optional[float],
2025-05-07T20:32:47.3988430Z         contiguous: bool,
2025-05-07T20:32:47.3988524Z         compiled: bool,
2025-05-07T20:32:47.3988603Z     ) -> None:
2025-05-07T20:32:47.3988704Z         torch.manual_seed(2025)
2025-05-07T20:32:47.3988780Z     
2025-05-07T20:32:47.3988957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.3989056Z     
2025-05-07T20:32:47.3989157Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.3989305Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.3989398Z         x = x_sign * x_clamp
2025-05-07T20:32:47.3989481Z         x0 = x[:, :D]
2025-05-07T20:32:47.3989562Z         x1 = x[:, D:]
2025-05-07T20:32:47.3989641Z     
2025-05-07T20:32:47.3989727Z         if contiguous:
2025-05-07T20:32:47.3989822Z             x0 = x0.contiguous()
2025-05-07T20:32:47.3989916Z             x1 = x1.contiguous()
2025-05-07T20:32:47.3989990Z     
2025-05-07T20:32:47.3990083Z         if scale_ub is not None:
2025-05-07T20:32:47.3990195Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.3990333Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.3990417Z             )
2025-05-07T20:32:47.3990497Z         else:
2025-05-07T20:32:47.3990597Z             scale_ub_tensor = None
2025-05-07T20:32:47.3990676Z     
2025-05-07T20:32:47.3990809Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.3990901Z             op = silu_mul_quant
2025-05-07T20:32:47.3990988Z             if compiled:
2025-05-07T20:32:47.3991090Z                 op = torch.compile(op)
2025-05-07T20:32:47.3991197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3991274Z     
2025-05-07T20:32:47.3991366Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.3991373Z 
2025-05-07T20:32:47.3991475Z moe/activation_test.py:117: 
2025-05-07T20:32:47.3991606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3991709Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.3991813Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.3992243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.3992344Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.3992855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.3992954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.3993326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.3993557Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.3993949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.3994047Z     kernel = self.compile(
2025-05-07T20:32:47.3994441Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.3994662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.3994835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.3994840Z 
2025-05-07T20:32:47.3995051Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898bbeb50>
2025-05-07T20:32:47.3995855Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.3996379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2e660>}
2025-05-07T20:32:47.3997149Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.3997354Z context = <triton._C.libtriton.ir.context object at 0x7f07ade3a6f0>
2025-05-07T20:32:47.3997358Z 
2025-05-07T20:32:47.3997532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.3997808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.3997919Z                            module_map=module_map)
2025-05-07T20:32:47.3998088Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.3998192Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.3998270Z E       ^
2025-05-07T20:32:47.3998638Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.3998645Z 
2025-05-07T20:32:47.3999074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.3999079Z 
2025-05-07T20:32:47.3999188Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.3999424Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.3999506Z     T=4096,
2025-05-07T20:32:47.3999589Z     D=5120,
2025-05-07T20:32:47.3999675Z     scale_ub=1200.0,
2025-05-07T20:32:47.3999761Z     contiguous=True,
2025-05-07T20:32:47.3999847Z     compiled=True,
2025-05-07T20:32:47.3999921Z )
2025-05-07T20:32:47.4000225Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4000407Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4000412Z 
2025-05-07T20:32:47.4000490Z     @given(
2025-05-07T20:32:47.4000614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4000718Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4000835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4000961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4001081Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4001156Z     )
2025-05-07T20:32:47.4001462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4001560Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4001638Z         self,
2025-05-07T20:32:47.4001722Z         T: int,
2025-05-07T20:32:47.4001802Z         D: int,
2025-05-07T20:32:47.4001902Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4001996Z         contiguous: bool,
2025-05-07T20:32:47.4002083Z         compiled: bool,
2025-05-07T20:32:47.4002161Z     ) -> None:
2025-05-07T20:32:47.4002260Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4002374Z     
2025-05-07T20:32:47.4002549Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4002631Z     
2025-05-07T20:32:47.4002725Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4002856Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4002946Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4003069Z         x0 = x[:, :D]
2025-05-07T20:32:47.4003155Z         x1 = x[:, D:]
2025-05-07T20:32:47.4003271Z     
2025-05-07T20:32:47.4003359Z         if contiguous:
2025-05-07T20:32:47.4003455Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4003545Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4003619Z     
2025-05-07T20:32:47.4003717Z         if scale_ub is not None:
2025-05-07T20:32:47.4003825Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4003963Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4004043Z             )
2025-05-07T20:32:47.4004120Z         else:
2025-05-07T20:32:47.4004221Z             scale_ub_tensor = None
2025-05-07T20:32:47.4004295Z     
2025-05-07T20:32:47.4004427Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4004520Z             op = silu_mul_quant
2025-05-07T20:32:47.4004606Z             if compiled:
2025-05-07T20:32:47.4004707Z                 op = torch.compile(op)
2025-05-07T20:32:47.4004820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4004898Z     
2025-05-07T20:32:47.4004993Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4004997Z 
2025-05-07T20:32:47.4005100Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4005231Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4005340Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4005442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4005822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4005926Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4006434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4006534Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4006907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4007149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4007503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4007599Z     kernel = self.compile(
2025-05-07T20:32:47.4007994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4008177Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4008307Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4008315Z 
2025-05-07T20:32:47.4008528Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa48d0>
2025-05-07T20:32:47.4009368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4009897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ade2f9c0>}
2025-05-07T20:32:47.4010670Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4010868Z context = <triton._C.libtriton.ir.context object at 0x7f07acc75fb0>
2025-05-07T20:32:47.4010913Z 
2025-05-07T20:32:47.4011088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4011363Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4011473Z                            module_map=module_map)
2025-05-07T20:32:47.4011707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4011811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4011928Z E       ^
2025-05-07T20:32:47.4012302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4012306Z 
2025-05-07T20:32:47.4012737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4012742Z 
2025-05-07T20:32:47.4012852Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4013084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4016865Z     T=128,
2025-05-07T20:32:47.4016953Z     D=5120,
2025-05-07T20:32:47.4017047Z     scale_ub=1200.0,
2025-05-07T20:32:47.4017139Z     contiguous=False,
2025-05-07T20:32:47.4017228Z     compiled=True,
2025-05-07T20:32:47.4017307Z )
2025-05-07T20:32:47.4017540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4017729Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.4017738Z 
2025-05-07T20:32:47.4017823Z     @given(
2025-05-07T20:32:47.4017946Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4018052Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4018173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4018292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4018412Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4018487Z     )
2025-05-07T20:32:47.4018745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4018847Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4018924Z         self,
2025-05-07T20:32:47.4019001Z         T: int,
2025-05-07T20:32:47.4019084Z         D: int,
2025-05-07T20:32:47.4019185Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4019281Z         contiguous: bool,
2025-05-07T20:32:47.4019372Z         compiled: bool,
2025-05-07T20:32:47.4019458Z     ) -> None:
2025-05-07T20:32:47.4019557Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4019634Z     
2025-05-07T20:32:47.4019809Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4019888Z     
2025-05-07T20:32:47.4019983Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4020110Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4020204Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4020285Z         x0 = x[:, :D]
2025-05-07T20:32:47.4020367Z         x1 = x[:, D:]
2025-05-07T20:32:47.4020445Z     
2025-05-07T20:32:47.4020530Z         if contiguous:
2025-05-07T20:32:47.4020624Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4020719Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4020792Z     
2025-05-07T20:32:47.4020885Z         if scale_ub is not None:
2025-05-07T20:32:47.4020996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4021245Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4021325Z             )
2025-05-07T20:32:47.4021408Z         else:
2025-05-07T20:32:47.4021505Z             scale_ub_tensor = None
2025-05-07T20:32:47.4021582Z     
2025-05-07T20:32:47.4021714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4021811Z             op = silu_mul_quant
2025-05-07T20:32:47.4021899Z             if compiled:
2025-05-07T20:32:47.4022002Z                 op = torch.compile(op)
2025-05-07T20:32:47.4022110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4022187Z     
2025-05-07T20:32:47.4022339Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4022344Z 
2025-05-07T20:32:47.4022444Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4022582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4022686Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4022851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4023294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4023392Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4023904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4024005Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4024377Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4024613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4024969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4025069Z     kernel = self.compile(
2025-05-07T20:32:47.4025465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4025656Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4025797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4025802Z 
2025-05-07T20:32:47.4026017Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace6b450>
2025-05-07T20:32:47.4026823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4027351Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acc84fe0>}
2025-05-07T20:32:47.4028118Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4028328Z context = <triton._C.libtriton.ir.context object at 0x7f07acc797b0>
2025-05-07T20:32:47.4028333Z 
2025-05-07T20:32:47.4028506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4028783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4028894Z                            module_map=module_map)
2025-05-07T20:32:47.4029060Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4029166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4029247Z E       ^
2025-05-07T20:32:47.4029613Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4029620Z 
2025-05-07T20:32:47.4030049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4030055Z 
2025-05-07T20:32:47.4030207Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4030447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4030527Z     T=16384,
2025-05-07T20:32:47.4030605Z     D=7168,
2025-05-07T20:32:47.4030693Z     scale_ub=1200.0,
2025-05-07T20:32:47.4030779Z     contiguous=True,
2025-05-07T20:32:47.4030863Z     compiled=True,
2025-05-07T20:32:47.4030944Z )
2025-05-07T20:32:47.4031170Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4031354Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4031401Z 
2025-05-07T20:32:47.4031480Z     @given(
2025-05-07T20:32:47.4031604Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4031709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4031832Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4031993Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4032115Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4032227Z     )
2025-05-07T20:32:47.4032488Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4032585Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4032663Z         self,
2025-05-07T20:32:47.4032742Z         T: int,
2025-05-07T20:32:47.4032821Z         D: int,
2025-05-07T20:32:47.4032921Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4033015Z         contiguous: bool,
2025-05-07T20:32:47.4033103Z         compiled: bool,
2025-05-07T20:32:47.4033182Z     ) -> None:
2025-05-07T20:32:47.4033285Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4033359Z     
2025-05-07T20:32:47.4033535Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4033616Z     
2025-05-07T20:32:47.4033710Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4033838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4033936Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4034019Z         x0 = x[:, :D]
2025-05-07T20:32:47.4034106Z         x1 = x[:, D:]
2025-05-07T20:32:47.4034180Z     
2025-05-07T20:32:47.4034267Z         if contiguous:
2025-05-07T20:32:47.4034362Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4034453Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4034525Z     
2025-05-07T20:32:47.4034620Z         if scale_ub is not None:
2025-05-07T20:32:47.4034729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4034868Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4034950Z             )
2025-05-07T20:32:47.4035029Z         else:
2025-05-07T20:32:47.4035125Z             scale_ub_tensor = None
2025-05-07T20:32:47.4035203Z     
2025-05-07T20:32:47.4035336Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4035431Z             op = silu_mul_quant
2025-05-07T20:32:47.4035520Z             if compiled:
2025-05-07T20:32:47.4035623Z                 op = torch.compile(op)
2025-05-07T20:32:47.4035740Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4035814Z     
2025-05-07T20:32:47.4035906Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4035910Z 
2025-05-07T20:32:47.4036012Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4036144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4036248Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4036352Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4036730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4036831Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4037342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4037443Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4037864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4038100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4038453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4038554Z     kernel = self.compile(
2025-05-07T20:32:47.4038948Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4039134Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4039305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4039310Z 
2025-05-07T20:32:47.4039521Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad4cc450>
2025-05-07T20:32:47.4040465Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4041028Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acc85e40>}
2025-05-07T20:32:47.4041800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4041997Z context = <triton._C.libtriton.ir.context object at 0x7f07acd2ce30>
2025-05-07T20:32:47.4042004Z 
2025-05-07T20:32:47.4042180Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4042454Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4042564Z                            module_map=module_map)
2025-05-07T20:32:47.4042739Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4042843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4042920Z E       ^
2025-05-07T20:32:47.4043288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4043293Z 
2025-05-07T20:32:47.4043720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4043724Z 
2025-05-07T20:32:47.4043836Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4044069Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4044146Z     T=16384,
2025-05-07T20:32:47.4044227Z     D=5120,
2025-05-07T20:32:47.4044313Z     scale_ub=1200.0,
2025-05-07T20:32:47.4044400Z     contiguous=True,
2025-05-07T20:32:47.4044487Z     compiled=False,
2025-05-07T20:32:47.4044566Z )
2025-05-07T20:32:47.4044794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4044985Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4044990Z 
2025-05-07T20:32:47.4045069Z     @given(
2025-05-07T20:32:47.4045193Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4045295Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4045413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4045535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4045654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4045733Z     )
2025-05-07T20:32:47.4045991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4046087Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4046166Z         self,
2025-05-07T20:32:47.4046243Z         T: int,
2025-05-07T20:32:47.4046320Z         D: int,
2025-05-07T20:32:47.4046427Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4046563Z         contiguous: bool,
2025-05-07T20:32:47.4046654Z         compiled: bool,
2025-05-07T20:32:47.4046739Z     ) -> None:
2025-05-07T20:32:47.4046834Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4046909Z     
2025-05-07T20:32:47.4047089Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4047164Z     
2025-05-07T20:32:47.4047258Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4047389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4047479Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4047561Z         x0 = x[:, :D]
2025-05-07T20:32:47.4047714Z         x1 = x[:, D:]
2025-05-07T20:32:47.4047787Z     
2025-05-07T20:32:47.4047874Z         if contiguous:
2025-05-07T20:32:47.4047966Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4048057Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4048132Z     
2025-05-07T20:32:47.4048225Z         if scale_ub is not None:
2025-05-07T20:32:47.4048378Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4048556Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4048634Z             )
2025-05-07T20:32:47.4048712Z         else:
2025-05-07T20:32:47.4048813Z             scale_ub_tensor = None
2025-05-07T20:32:47.4048888Z     
2025-05-07T20:32:47.4049045Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4049147Z             op = silu_mul_quant
2025-05-07T20:32:47.4049251Z             if compiled:
2025-05-07T20:32:47.4049356Z                 op = torch.compile(op)
2025-05-07T20:32:47.4049463Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4049539Z     
2025-05-07T20:32:47.4049634Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4049639Z 
2025-05-07T20:32:47.4049740Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4049872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4049981Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4050086Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4050604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4050707Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4051077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4051311Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4051665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4051764Z     kernel = self.compile(
2025-05-07T20:32:47.4052165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4052349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4052489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4052496Z 
2025-05-07T20:32:47.4052712Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa56d0>
2025-05-07T20:32:47.4053513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4054035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acc86ca0>}
2025-05-07T20:32:47.4054803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4055004Z context = <triton._C.libtriton.ir.context object at 0x7f07acd60df0>
2025-05-07T20:32:47.4055011Z 
2025-05-07T20:32:47.4055229Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4055503Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4055617Z                            module_map=module_map)
2025-05-07T20:32:47.4055783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4055887Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4055965Z E       ^
2025-05-07T20:32:47.4056332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4056380Z 
2025-05-07T20:32:47.4056814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4056820Z 
2025-05-07T20:32:47.4056927Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4057203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4057286Z     T=1,
2025-05-07T20:32:47.4057400Z     D=7168,
2025-05-07T20:32:47.4057490Z     scale_ub=1200.0,
2025-05-07T20:32:47.4057579Z     contiguous=False,
2025-05-07T20:32:47.4057664Z     compiled=False,
2025-05-07T20:32:47.4057741Z )
2025-05-07T20:32:47.4057967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4058144Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.4058149Z 
2025-05-07T20:32:47.4058236Z     @given(
2025-05-07T20:32:47.4058358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4058466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4058584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4058703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4058822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4058900Z     )
2025-05-07T20:32:47.4059156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4059258Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4059335Z         self,
2025-05-07T20:32:47.4059413Z         T: int,
2025-05-07T20:32:47.4059492Z         D: int,
2025-05-07T20:32:47.4059592Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4059685Z         contiguous: bool,
2025-05-07T20:32:47.4059774Z         compiled: bool,
2025-05-07T20:32:47.4059853Z     ) -> None:
2025-05-07T20:32:47.4059952Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4060026Z     
2025-05-07T20:32:47.4060199Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4060281Z     
2025-05-07T20:32:47.4060374Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4060502Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4060595Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4060675Z         x0 = x[:, :D]
2025-05-07T20:32:47.4060760Z         x1 = x[:, D:]
2025-05-07T20:32:47.4060840Z     
2025-05-07T20:32:47.4060929Z         if contiguous:
2025-05-07T20:32:47.4061021Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4061116Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4061189Z     
2025-05-07T20:32:47.4061281Z         if scale_ub is not None:
2025-05-07T20:32:47.4061394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4061532Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4061611Z             )
2025-05-07T20:32:47.4061688Z         else:
2025-05-07T20:32:47.4061784Z             scale_ub_tensor = None
2025-05-07T20:32:47.4061864Z     
2025-05-07T20:32:47.4061997Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4062090Z             op = silu_mul_quant
2025-05-07T20:32:47.4062178Z             if compiled:
2025-05-07T20:32:47.4062280Z                 op = torch.compile(op)
2025-05-07T20:32:47.4062388Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4062467Z     
2025-05-07T20:32:47.4062609Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4062616Z 
2025-05-07T20:32:47.4062720Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4062854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4062956Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4063063Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4063575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4063676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4064092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4064323Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4064681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4064822Z     kernel = self.compile(
2025-05-07T20:32:47.4065261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4065444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4065574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4065578Z 
2025-05-07T20:32:47.4065794Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ad753550>
2025-05-07T20:32:47.4066595Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4067118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd740e0>}
2025-05-07T20:32:47.4067892Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4068090Z context = <triton._C.libtriton.ir.context object at 0x7f07acdf6870>
2025-05-07T20:32:47.4068099Z 
2025-05-07T20:32:47.4068269Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4068541Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4068656Z                            module_map=module_map)
2025-05-07T20:32:47.4068821Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4068922Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4069004Z E       ^
2025-05-07T20:32:47.4069369Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4069378Z 
2025-05-07T20:32:47.4069812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4069817Z 
2025-05-07T20:32:47.4069923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4070155Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4070236Z     T=4096,
2025-05-07T20:32:47.4070314Z     D=7168,
2025-05-07T20:32:47.4070399Z     scale_ub=1200.0,
2025-05-07T20:32:47.4070490Z     contiguous=False,
2025-05-07T20:32:47.4070574Z     compiled=True,
2025-05-07T20:32:47.4070651Z )
2025-05-07T20:32:47.4070882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4071064Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.4071068Z 
2025-05-07T20:32:47.4071149Z     @given(
2025-05-07T20:32:47.4071272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4071421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4071546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4071667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4071782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4071862Z     )
2025-05-07T20:32:47.4072116Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4072215Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4072294Z         self,
2025-05-07T20:32:47.4072371Z         T: int,
2025-05-07T20:32:47.4072451Z         D: int,
2025-05-07T20:32:47.4072636Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4072728Z         contiguous: bool,
2025-05-07T20:32:47.4072819Z         compiled: bool,
2025-05-07T20:32:47.4072898Z     ) -> None:
2025-05-07T20:32:47.4072995Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4073071Z     
2025-05-07T20:32:47.4073287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4073365Z     
2025-05-07T20:32:47.4073500Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4073629Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4073720Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4073805Z         x0 = x[:, :D]
2025-05-07T20:32:47.4073885Z         x1 = x[:, D:]
2025-05-07T20:32:47.4073961Z     
2025-05-07T20:32:47.4074045Z         if contiguous:
2025-05-07T20:32:47.4074138Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4074231Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4074305Z     
2025-05-07T20:32:47.4074399Z         if scale_ub is not None:
2025-05-07T20:32:47.4074510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4074649Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4074725Z             )
2025-05-07T20:32:47.4074806Z         else:
2025-05-07T20:32:47.4074902Z             scale_ub_tensor = None
2025-05-07T20:32:47.4074980Z     
2025-05-07T20:32:47.4075120Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4075222Z             op = silu_mul_quant
2025-05-07T20:32:47.4075310Z             if compiled:
2025-05-07T20:32:47.4075412Z                 op = torch.compile(op)
2025-05-07T20:32:47.4075523Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4075596Z     
2025-05-07T20:32:47.4075687Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4075692Z 
2025-05-07T20:32:47.4075793Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4075924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4076030Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4076134Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4076513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4076613Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4077131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4077231Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4077603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4077836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4078189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4078289Z     kernel = self.compile(
2025-05-07T20:32:47.4078686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4078871Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4079010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4079018Z 
2025-05-07T20:32:47.4079340Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acfa60d0>
2025-05-07T20:32:47.4080219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4080744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd75300>}
2025-05-07T20:32:47.4081515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4081757Z context = <triton._C.libtriton.ir.context object at 0x7f07ad1677f0>
2025-05-07T20:32:47.4081762Z 
2025-05-07T20:32:47.4081934Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4082286Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4082399Z                            module_map=module_map)
2025-05-07T20:32:47.4082568Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4082669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4082748Z E       ^
2025-05-07T20:32:47.4083117Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4083121Z 
2025-05-07T20:32:47.4083550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4083557Z 
2025-05-07T20:32:47.4083666Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4083897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4083975Z     T=128,
2025-05-07T20:32:47.4084059Z     D=7168,
2025-05-07T20:32:47.4084145Z     scale_ub=1200.0,
2025-05-07T20:32:47.4084234Z     contiguous=False,
2025-05-07T20:32:47.4084323Z     compiled=True,
2025-05-07T20:32:47.4084397Z )
2025-05-07T20:32:47.4084623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4084803Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:47.4084808Z 
2025-05-07T20:32:47.4084886Z     @given(
2025-05-07T20:32:47.4085013Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4085115Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4085235Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4085362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4085478Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4085554Z     )
2025-05-07T20:32:47.4085812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4085910Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4085993Z         self,
2025-05-07T20:32:47.4086073Z         T: int,
2025-05-07T20:32:47.4086151Z         D: int,
2025-05-07T20:32:47.4086253Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4086345Z         contiguous: bool,
2025-05-07T20:32:47.4086434Z         compiled: bool,
2025-05-07T20:32:47.4086519Z     ) -> None:
2025-05-07T20:32:47.4086616Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4086689Z     
2025-05-07T20:32:47.4086868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4086944Z     
2025-05-07T20:32:47.4087040Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4087172Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4087263Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4087345Z         x0 = x[:, :D]
2025-05-07T20:32:47.4087431Z         x1 = x[:, D:]
2025-05-07T20:32:47.4087504Z     
2025-05-07T20:32:47.4087597Z         if contiguous:
2025-05-07T20:32:47.4087690Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4087826Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4087906Z     
2025-05-07T20:32:47.4088000Z         if scale_ub is not None:
2025-05-07T20:32:47.4088109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4088254Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4088331Z             )
2025-05-07T20:32:47.4088408Z         else:
2025-05-07T20:32:47.4088511Z             scale_ub_tensor = None
2025-05-07T20:32:47.4088586Z     
2025-05-07T20:32:47.4088718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4088857Z             op = silu_mul_quant
2025-05-07T20:32:47.4088945Z             if compiled:
2025-05-07T20:32:47.4089048Z                 op = torch.compile(op)
2025-05-07T20:32:47.4089157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4089230Z     
2025-05-07T20:32:47.4089324Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4089368Z 
2025-05-07T20:32:47.4089470Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4089641Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4089749Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4089855Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4090235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4090334Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4090845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4090951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4091321Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4091555Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4091915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4092013Z     kernel = self.compile(
2025-05-07T20:32:47.4092412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4092597Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4092728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4092733Z 
2025-05-07T20:32:47.4092946Z self = <triton.compiler.compiler.ASTSource object at 0x7f07aca11ad0>
2025-05-07T20:32:47.4093746Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4094276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd76160>}
2025-05-07T20:32:47.4095049Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4095247Z context = <triton._C.libtriton.ir.context object at 0x7f07ad122d70>
2025-05-07T20:32:47.4095252Z 
2025-05-07T20:32:47.4095425Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4095697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4095811Z                            module_map=module_map)
2025-05-07T20:32:47.4095977Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4096080Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4096161Z E       ^
2025-05-07T20:32:47.4096573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4096581Z 
2025-05-07T20:32:47.4097012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4097019Z 
2025-05-07T20:32:47.4097124Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4097355Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4097435Z     T=2048,
2025-05-07T20:32:47.4097512Z     D=7168,
2025-05-07T20:32:47.4097596Z     scale_ub=None,
2025-05-07T20:32:47.4097687Z     contiguous=True,
2025-05-07T20:32:47.4097815Z     compiled=True,
2025-05-07T20:32:47.4097890Z )
2025-05-07T20:32:47.4098120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4098298Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.4098303Z 
2025-05-07T20:32:47.4098427Z     @given(
2025-05-07T20:32:47.4098552Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4098691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4098816Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4098935Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4099051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4099128Z     )
2025-05-07T20:32:47.4099382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4099478Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4099559Z         self,
2025-05-07T20:32:47.4099639Z         T: int,
2025-05-07T20:32:47.4099716Z         D: int,
2025-05-07T20:32:47.4099819Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4099910Z         contiguous: bool,
2025-05-07T20:32:47.4099999Z         compiled: bool,
2025-05-07T20:32:47.4100077Z     ) -> None:
2025-05-07T20:32:47.4100174Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4100255Z     
2025-05-07T20:32:47.4100433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4100513Z     
2025-05-07T20:32:47.4100608Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4100736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4100828Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4100914Z         x0 = x[:, :D]
2025-05-07T20:32:47.4100995Z         x1 = x[:, D:]
2025-05-07T20:32:47.4101068Z     
2025-05-07T20:32:47.4101157Z         if contiguous:
2025-05-07T20:32:47.4101250Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4101345Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4101422Z     
2025-05-07T20:32:47.4101515Z         if scale_ub is not None:
2025-05-07T20:32:47.4101627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4101767Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4101843Z             )
2025-05-07T20:32:47.4101926Z         else:
2025-05-07T20:32:47.4102022Z             scale_ub_tensor = None
2025-05-07T20:32:47.4102098Z     
2025-05-07T20:32:47.4102237Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4102330Z             op = silu_mul_quant
2025-05-07T20:32:47.4102416Z             if compiled:
2025-05-07T20:32:47.4102522Z                 op = torch.compile(op)
2025-05-07T20:32:47.4102630Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4102707Z     
2025-05-07T20:32:47.4102799Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4102803Z 
2025-05-07T20:32:47.4102902Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4103039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4103145Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4103246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4103628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4103726Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4104283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4104387Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4104756Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4104993Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4105343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4105478Z     kernel = self.compile(
2025-05-07T20:32:47.4105876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4106056Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4106187Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4106230Z 
2025-05-07T20:32:47.4106507Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace69650>
2025-05-07T20:32:47.4107309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4107834Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acd77420>}
2025-05-07T20:32:47.4108600Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4108800Z context = <triton._C.libtriton.ir.context object at 0x7f07acbaa6b0>
2025-05-07T20:32:47.4108808Z 
2025-05-07T20:32:47.4108985Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4109309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4109421Z                            module_map=module_map)
2025-05-07T20:32:47.4109585Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4109689Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4109766Z E       ^
2025-05-07T20:32:47.4110131Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4110137Z 
2025-05-07T20:32:47.4110572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4110576Z 
2025-05-07T20:32:47.4110682Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4110916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4110997Z     T=16384,
2025-05-07T20:32:47.4111081Z     D=5120,
2025-05-07T20:32:47.4111172Z     scale_ub=None,
2025-05-07T20:32:47.4111260Z     contiguous=False,
2025-05-07T20:32:47.4111344Z     compiled=False,
2025-05-07T20:32:47.4111422Z )
2025-05-07T20:32:47.4111646Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4111829Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.4111833Z 
2025-05-07T20:32:47.4111914Z     @given(
2025-05-07T20:32:47.4112037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4112144Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4112260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4112378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4112494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4112570Z     )
2025-05-07T20:32:47.4112826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4112969Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4113049Z         self,
2025-05-07T20:32:47.4113128Z         T: int,
2025-05-07T20:32:47.4113208Z         D: int,
2025-05-07T20:32:47.4113495Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4113633Z         contiguous: bool,
2025-05-07T20:32:47.4113768Z         compiled: bool,
2025-05-07T20:32:47.4113881Z     ) -> None:
2025-05-07T20:32:47.4113985Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4114060Z     
2025-05-07T20:32:47.4114234Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4114399Z     
2025-05-07T20:32:47.4114492Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4114620Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4116553Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4116612Z 
2025-05-07T20:32:47.4116736Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.4116741Z 
2025-05-07T20:32:47.4116847Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4117080Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4117156Z     T=4096,
2025-05-07T20:32:47.4117236Z     D=7168,
2025-05-07T20:32:47.4117319Z     scale_ub=1200.0,
2025-05-07T20:32:47.4117407Z     contiguous=True,
2025-05-07T20:32:47.4117490Z     compiled=True,
2025-05-07T20:32:47.4117564Z )
2025-05-07T20:32:47.4117795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4117975Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4117979Z 
2025-05-07T20:32:47.4118057Z     @given(
2025-05-07T20:32:47.4118183Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4118284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4118401Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4118522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4118637Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4118716Z     )
2025-05-07T20:32:47.4118973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4119078Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4119173Z         self,
2025-05-07T20:32:47.4119262Z         T: int,
2025-05-07T20:32:47.4119356Z         D: int,
2025-05-07T20:32:47.4119462Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4119555Z         contiguous: bool,
2025-05-07T20:32:47.4119646Z         compiled: bool,
2025-05-07T20:32:47.4119727Z     ) -> None:
2025-05-07T20:32:47.4119822Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4119898Z     
2025-05-07T20:32:47.4120137Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4120213Z     
2025-05-07T20:32:47.4120309Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4120435Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4122346Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4122361Z 
2025-05-07T20:32:47.4122482Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.4122487Z 
2025-05-07T20:32:47.4122591Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4122823Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4122902Z     T=16384,
2025-05-07T20:32:47.4122980Z     D=7168,
2025-05-07T20:32:47.4123068Z     scale_ub=None,
2025-05-07T20:32:47.4123155Z     contiguous=False,
2025-05-07T20:32:47.4123240Z     compiled=False,
2025-05-07T20:32:47.4123359Z )
2025-05-07T20:32:47.4123585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4123768Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.4123773Z 
2025-05-07T20:32:47.4123851Z     @given(
2025-05-07T20:32:47.4124014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4124120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4124275Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4124396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4124514Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4124588Z     )
2025-05-07T20:32:47.4124840Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4124937Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4125014Z         self,
2025-05-07T20:32:47.4125093Z         T: int,
2025-05-07T20:32:47.4125171Z         D: int,
2025-05-07T20:32:47.4125271Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4125363Z         contiguous: bool,
2025-05-07T20:32:47.4125450Z         compiled: bool,
2025-05-07T20:32:47.4125529Z     ) -> None:
2025-05-07T20:32:47.4125626Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4125703Z     
2025-05-07T20:32:47.4125879Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4127745Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4127752Z 
2025-05-07T20:32:47.4127872Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4127877Z 
2025-05-07T20:32:47.4127985Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4128213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4128296Z     T=2048,
2025-05-07T20:32:47.4128372Z     D=7168,
2025-05-07T20:32:47.4128457Z     scale_ub=1200.0,
2025-05-07T20:32:47.4128547Z     contiguous=True,
2025-05-07T20:32:47.4128630Z     compiled=True,
2025-05-07T20:32:47.4128703Z )
2025-05-07T20:32:47.4128932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4129109Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4129114Z 
2025-05-07T20:32:47.4129191Z     @given(
2025-05-07T20:32:47.4129315Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4129415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4129539Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4129658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4129774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4129851Z     )
2025-05-07T20:32:47.4130104Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4130248Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4130330Z         self,
2025-05-07T20:32:47.4130408Z         T: int,
2025-05-07T20:32:47.4130484Z         D: int,
2025-05-07T20:32:47.4130585Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4130675Z         contiguous: bool,
2025-05-07T20:32:47.4130761Z         compiled: bool,
2025-05-07T20:32:47.4130846Z     ) -> None:
2025-05-07T20:32:47.4130942Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4131018Z     
2025-05-07T20:32:47.4131191Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4131308Z     
2025-05-07T20:32:47.4131403Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4131529Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4133386Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4133430Z 
2025-05-07T20:32:47.4133552Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.4133557Z 
2025-05-07T20:32:47.4133661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4133895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4133973Z     T=2048,
2025-05-07T20:32:47.4134049Z     D=7168,
2025-05-07T20:32:47.4134134Z     scale_ub=None,
2025-05-07T20:32:47.4134218Z     contiguous=True,
2025-05-07T20:32:47.4134305Z     compiled=False,
2025-05-07T20:32:47.4134377Z )
2025-05-07T20:32:47.4134601Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4134787Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4134791Z 
2025-05-07T20:32:47.4134869Z     @given(
2025-05-07T20:32:47.4134989Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4135093Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4135208Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4135327Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4135445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4135520Z     )
2025-05-07T20:32:47.4135781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4135876Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4135951Z         self,
2025-05-07T20:32:47.4136031Z         T: int,
2025-05-07T20:32:47.4136107Z         D: int,
2025-05-07T20:32:47.4136207Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4136306Z         contiguous: bool,
2025-05-07T20:32:47.4136395Z         compiled: bool,
2025-05-07T20:32:47.4136475Z     ) -> None:
2025-05-07T20:32:47.4136573Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4136646Z     
2025-05-07T20:32:47.4136818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4136895Z     
2025-05-07T20:32:47.4136987Z >       x_sign = torch.sign(x)
2025-05-07T20:32:47.4138809Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4138818Z 
2025-05-07T20:32:47.4138984Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:47.4138990Z 
2025-05-07T20:32:47.4139098Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4139328Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4139404Z     T=1,
2025-05-07T20:32:47.4139485Z     D=7168,
2025-05-07T20:32:47.4139567Z     scale_ub=1200.0,
2025-05-07T20:32:47.4139651Z     contiguous=True,
2025-05-07T20:32:47.4139739Z     compiled=False,
2025-05-07T20:32:47.4139812Z )
2025-05-07T20:32:47.4140038Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4140253Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4140258Z 
2025-05-07T20:32:47.4143790Z     @given(
2025-05-07T20:32:47.4143935Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4144039Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4144257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4144420Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4144542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4144621Z     )
2025-05-07T20:32:47.4144877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4144974Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4145057Z         self,
2025-05-07T20:32:47.4145135Z         T: int,
2025-05-07T20:32:47.4145216Z         D: int,
2025-05-07T20:32:47.4145317Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4145410Z         contiguous: bool,
2025-05-07T20:32:47.4145502Z         compiled: bool,
2025-05-07T20:32:47.4145582Z     ) -> None:
2025-05-07T20:32:47.4145679Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4145755Z     
2025-05-07T20:32:47.4145932Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4146010Z     
2025-05-07T20:32:47.4146108Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4146240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4146330Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4146416Z         x0 = x[:, :D]
2025-05-07T20:32:47.4146499Z         x1 = x[:, D:]
2025-05-07T20:32:47.4146575Z     
2025-05-07T20:32:47.4146661Z         if contiguous:
2025-05-07T20:32:47.4146754Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4146848Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4146922Z     
2025-05-07T20:32:47.4147015Z         if scale_ub is not None:
2025-05-07T20:32:47.4147130Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4147274Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4147352Z             )
2025-05-07T20:32:47.4147432Z         else:
2025-05-07T20:32:47.4147530Z             scale_ub_tensor = None
2025-05-07T20:32:47.4147604Z     
2025-05-07T20:32:47.4147740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4147837Z             op = silu_mul_quant
2025-05-07T20:32:47.4147927Z             if compiled:
2025-05-07T20:32:47.4148036Z                 op = torch.compile(op)
2025-05-07T20:32:47.4148145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4148221Z     
2025-05-07T20:32:47.4148317Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4148322Z 
2025-05-07T20:32:47.4148422Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4148559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4148662Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4148765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4149319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4149436Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4149820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4150104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4150461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4150561Z     kernel = self.compile(
2025-05-07T20:32:47.4150959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4151142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4151276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4151321Z 
2025-05-07T20:32:47.4151535Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf579d0>
2025-05-07T20:32:47.4152345Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4152946Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acb462a0>}
2025-05-07T20:32:47.4153721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4153920Z context = <triton._C.libtriton.ir.context object at 0x7f07acae9470>
2025-05-07T20:32:47.4153926Z 
2025-05-07T20:32:47.4154097Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4154376Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4154486Z                            module_map=module_map)
2025-05-07T20:32:47.4154657Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4154767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4154848Z E       ^
2025-05-07T20:32:47.4155220Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4155225Z 
2025-05-07T20:32:47.4155654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4155658Z 
2025-05-07T20:32:47.4155764Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4155997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4156078Z     T=128,
2025-05-07T20:32:47.4156162Z     D=5120,
2025-05-07T20:32:47.4156245Z     scale_ub=None,
2025-05-07T20:32:47.4156332Z     contiguous=True,
2025-05-07T20:32:47.4156420Z     compiled=False,
2025-05-07T20:32:47.4156494Z )
2025-05-07T20:32:47.4156720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4156905Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4156912Z 
2025-05-07T20:32:47.4156990Z     @given(
2025-05-07T20:32:47.4157113Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4157220Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4157339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4157462Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4157578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4157652Z     )
2025-05-07T20:32:47.4157915Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4158014Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4158092Z         self,
2025-05-07T20:32:47.4158173Z         T: int,
2025-05-07T20:32:47.4158250Z         D: int,
2025-05-07T20:32:47.4158352Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4158450Z         contiguous: bool,
2025-05-07T20:32:47.4158540Z         compiled: bool,
2025-05-07T20:32:47.4158665Z     ) -> None:
2025-05-07T20:32:47.4158768Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4158841Z     
2025-05-07T20:32:47.4159019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4159096Z     
2025-05-07T20:32:47.4159189Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4159319Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4159409Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4159490Z         x0 = x[:, :D]
2025-05-07T20:32:47.4159574Z         x1 = x[:, D:]
2025-05-07T20:32:47.4159690Z     
2025-05-07T20:32:47.4159776Z         if contiguous:
2025-05-07T20:32:47.4159873Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4159964Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4160037Z     
2025-05-07T20:32:47.4160227Z         if scale_ub is not None:
2025-05-07T20:32:47.4160340Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4160529Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4160649Z             )
2025-05-07T20:32:47.4160727Z         else:
2025-05-07T20:32:47.4160827Z             scale_ub_tensor = None
2025-05-07T20:32:47.4160901Z     
2025-05-07T20:32:47.4161037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4161132Z             op = silu_mul_quant
2025-05-07T20:32:47.4161217Z             if compiled:
2025-05-07T20:32:47.4161319Z                 op = torch.compile(op)
2025-05-07T20:32:47.4161429Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4161504Z     
2025-05-07T20:32:47.4161599Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4161603Z 
2025-05-07T20:32:47.4161706Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4161841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4161948Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4162049Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4162573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4162677Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4163049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4163284Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4163641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4163738Z     kernel = self.compile(
2025-05-07T20:32:47.4164141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4164323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4164453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4164462Z 
2025-05-07T20:32:47.4164684Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acf576d0>
2025-05-07T20:32:47.4165489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4166015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07acb471a0>}
2025-05-07T20:32:47.4166785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4166987Z context = <triton._C.libtriton.ir.context object at 0x7f07acac6f30>
2025-05-07T20:32:47.4166994Z 
2025-05-07T20:32:47.4167165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4167486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4167601Z                            module_map=module_map)
2025-05-07T20:32:47.4167770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4167870Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4167952Z E       ^
2025-05-07T20:32:47.4168321Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4168326Z 
2025-05-07T20:32:47.4168800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4168804Z 
2025-05-07T20:32:47.4168910Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4169142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4169264Z     T=128,
2025-05-07T20:32:47.4169341Z     D=7168,
2025-05-07T20:32:47.4169429Z     scale_ub=None,
2025-05-07T20:32:47.4169559Z     contiguous=True,
2025-05-07T20:32:47.4169647Z     compiled=False,
2025-05-07T20:32:47.4169722Z )
2025-05-07T20:32:47.4169952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4170128Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4170133Z 
2025-05-07T20:32:47.4170214Z     @given(
2025-05-07T20:32:47.4170337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4170438Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4170564Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4170684Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4170799Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4170876Z     )
2025-05-07T20:32:47.4171133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4171238Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4171317Z         self,
2025-05-07T20:32:47.4171395Z         T: int,
2025-05-07T20:32:47.4171476Z         D: int,
2025-05-07T20:32:47.4171576Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4171673Z         contiguous: bool,
2025-05-07T20:32:47.4171762Z         compiled: bool,
2025-05-07T20:32:47.4171841Z     ) -> None:
2025-05-07T20:32:47.4171937Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4172014Z     
2025-05-07T20:32:47.4172192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4172269Z     
2025-05-07T20:32:47.4172365Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4172493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4172585Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4172666Z         x0 = x[:, :D]
2025-05-07T20:32:47.4172746Z         x1 = x[:, D:]
2025-05-07T20:32:47.4172826Z     
2025-05-07T20:32:47.4172911Z         if contiguous:
2025-05-07T20:32:47.4173006Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4173104Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4173178Z     
2025-05-07T20:32:47.4173271Z         if scale_ub is not None:
2025-05-07T20:32:47.4173383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4173523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4173600Z             )
2025-05-07T20:32:47.4173680Z         else:
2025-05-07T20:32:47.4173776Z             scale_ub_tensor = None
2025-05-07T20:32:47.4173850Z     
2025-05-07T20:32:47.4173989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4174083Z             op = silu_mul_quant
2025-05-07T20:32:47.4174171Z             if compiled:
2025-05-07T20:32:47.4174272Z                 op = torch.compile(op)
2025-05-07T20:32:47.4174380Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4174456Z     
2025-05-07T20:32:47.4174551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4174556Z 
2025-05-07T20:32:47.4174701Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4174842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4174944Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4175047Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4175569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4175668Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4176043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4176340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4176697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4176837Z     kernel = self.compile(
2025-05-07T20:32:47.4177274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4177466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4177603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4177607Z 
2025-05-07T20:32:47.4177819Z self = <triton.compiler.compiler.ASTSource object at 0x7f0898bbfe50>
2025-05-07T20:32:47.4178632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4179185Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07aca58040>}
2025-05-07T20:32:47.4179987Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4180186Z context = <triton._C.libtriton.ir.context object at 0x7f07ac9b7170>
2025-05-07T20:32:47.4180191Z 
2025-05-07T20:32:47.4180361Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4180638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4180747Z                            module_map=module_map)
2025-05-07T20:32:47.4180918Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4181021Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4181100Z E       ^
2025-05-07T20:32:47.4181470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4181474Z 
2025-05-07T20:32:47.4181910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4181917Z 
2025-05-07T20:32:47.4182027Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4182257Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4182335Z     T=2048,
2025-05-07T20:32:47.4182415Z     D=7168,
2025-05-07T20:32:47.4182500Z     scale_ub=1200.0,
2025-05-07T20:32:47.4182586Z     contiguous=True,
2025-05-07T20:32:47.4182673Z     compiled=False,
2025-05-07T20:32:47.4182748Z )
2025-05-07T20:32:47.4182974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4183160Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4183165Z 
2025-05-07T20:32:47.4183243Z     @given(
2025-05-07T20:32:47.4183370Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4183473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4183639Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4183769Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4183889Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4183963Z     )
2025-05-07T20:32:47.4184222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4184318Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4184397Z         self,
2025-05-07T20:32:47.4184478Z         T: int,
2025-05-07T20:32:47.4184556Z         D: int,
2025-05-07T20:32:47.4184658Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4184799Z         contiguous: bool,
2025-05-07T20:32:47.4184886Z         compiled: bool,
2025-05-07T20:32:47.4184971Z     ) -> None:
2025-05-07T20:32:47.4185067Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4185140Z     
2025-05-07T20:32:47.4185317Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4187242Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4187249Z 
2025-05-07T20:32:47.4187375Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4187381Z 
2025-05-07T20:32:47.4187488Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4187719Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4187800Z     T=1,
2025-05-07T20:32:47.4187879Z     D=5120,
2025-05-07T20:32:47.4187964Z     scale_ub=1200.0,
2025-05-07T20:32:47.4188055Z     contiguous=True,
2025-05-07T20:32:47.4188142Z     compiled=False,
2025-05-07T20:32:47.4188221Z )
2025-05-07T20:32:47.4188448Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4188621Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4188626Z 
2025-05-07T20:32:47.4188705Z     @given(
2025-05-07T20:32:47.4188829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4188930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4189050Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4189172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4189308Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4189397Z     )
2025-05-07T20:32:47.4189676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4189775Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4189859Z         self,
2025-05-07T20:32:47.4189937Z         T: int,
2025-05-07T20:32:47.4190018Z         D: int,
2025-05-07T20:32:47.4190124Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4190217Z         contiguous: bool,
2025-05-07T20:32:47.4190307Z         compiled: bool,
2025-05-07T20:32:47.4190386Z     ) -> None:
2025-05-07T20:32:47.4190482Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4190561Z     
2025-05-07T20:32:47.4190736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4190811Z     
2025-05-07T20:32:47.4190910Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4191039Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4191135Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4191217Z         x0 = x[:, :D]
2025-05-07T20:32:47.4191297Z         x1 = x[:, D:]
2025-05-07T20:32:47.4191374Z     
2025-05-07T20:32:47.4191463Z         if contiguous:
2025-05-07T20:32:47.4191556Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4191654Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4191729Z     
2025-05-07T20:32:47.4191872Z         if scale_ub is not None:
2025-05-07T20:32:47.4191986Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4192127Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4192204Z             )
2025-05-07T20:32:47.4192284Z         else:
2025-05-07T20:32:47.4192379Z             scale_ub_tensor = None
2025-05-07T20:32:47.4192457Z     
2025-05-07T20:32:47.4192592Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4192684Z             op = silu_mul_quant
2025-05-07T20:32:47.4192775Z             if compiled:
2025-05-07T20:32:47.4193208Z                 op = torch.compile(op)
2025-05-07T20:32:47.4193318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4193393Z     
2025-05-07T20:32:47.4193487Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4193492Z 
2025-05-07T20:32:47.4193595Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4193772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4193913Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4194020Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4194537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4194639Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4195012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4195243Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4195605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4195701Z     kernel = self.compile(
2025-05-07T20:32:47.4196097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4196287Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4196419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4196423Z 
2025-05-07T20:32:47.4196642Z self = <triton.compiler.compiler.ASTSource object at 0x7f07ace6b5d0>
2025-05-07T20:32:47.4197443Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4197967Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07aca59580>}
2025-05-07T20:32:47.4198743Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4198950Z context = <triton._C.libtriton.ir.context object at 0x7f07ac9178b0>
2025-05-07T20:32:47.4198955Z 
2025-05-07T20:32:47.4199130Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4199404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4199517Z                            module_map=module_map)
2025-05-07T20:32:47.4199683Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4199784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4199868Z E       ^
2025-05-07T20:32:47.4200290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4200295Z 
2025-05-07T20:32:47.4200728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4200735Z 
2025-05-07T20:32:47.4200893Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4201129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4201211Z     T=2048,
2025-05-07T20:32:47.4201288Z     D=5120,
2025-05-07T20:32:47.4201372Z     scale_ub=None,
2025-05-07T20:32:47.4201463Z     contiguous=True,
2025-05-07T20:32:47.4201549Z     compiled=False,
2025-05-07T20:32:47.4201624Z )
2025-05-07T20:32:47.4201852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4202033Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4202080Z 
2025-05-07T20:32:47.4202161Z     @given(
2025-05-07T20:32:47.4202286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4202388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4202510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4202673Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4202794Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4202910Z     )
2025-05-07T20:32:47.4203165Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4203261Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4203341Z         self,
2025-05-07T20:32:47.4203418Z         T: int,
2025-05-07T20:32:47.4203496Z         D: int,
2025-05-07T20:32:47.4203598Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4203690Z         contiguous: bool,
2025-05-07T20:32:47.4203777Z         compiled: bool,
2025-05-07T20:32:47.4203861Z     ) -> None:
2025-05-07T20:32:47.4203961Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4204038Z     
2025-05-07T20:32:47.4204215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4204299Z     
2025-05-07T20:32:47.4204393Z >       x_sign = torch.sign(x)
2025-05-07T20:32:47.4206232Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4206245Z 
2025-05-07T20:32:47.4206366Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:47.4206373Z 
2025-05-07T20:32:47.4206477Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4206712Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4206791Z     T=16384,
2025-05-07T20:32:47.4206869Z     D=5120,
2025-05-07T20:32:47.4206956Z     scale_ub=None,
2025-05-07T20:32:47.4207045Z     contiguous=True,
2025-05-07T20:32:47.4207133Z     compiled=False,
2025-05-07T20:32:47.4207210Z )
2025-05-07T20:32:47.4207437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4207622Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4207627Z 
2025-05-07T20:32:47.4207704Z     @given(
2025-05-07T20:32:47.4207826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4207929Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4208046Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4208165Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4208289Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4208363Z     )
2025-05-07T20:32:47.4208620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4208718Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4208795Z         self,
2025-05-07T20:32:47.4208877Z         T: int,
2025-05-07T20:32:47.4208953Z         D: int,
2025-05-07T20:32:47.4209202Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4209299Z         contiguous: bool,
2025-05-07T20:32:47.4209386Z         compiled: bool,
2025-05-07T20:32:47.4209465Z     ) -> None:
2025-05-07T20:32:47.4209562Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4209635Z     
2025-05-07T20:32:47.4209809Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4211642Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4211727Z 
2025-05-07T20:32:47.4211887Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4211896Z 
2025-05-07T20:32:47.4212003Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4212235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4212316Z     T=4096,
2025-05-07T20:32:47.4212392Z     D=5120,
2025-05-07T20:32:47.4212475Z     scale_ub=None,
2025-05-07T20:32:47.4212563Z     contiguous=True,
2025-05-07T20:32:47.4212648Z     compiled=False,
2025-05-07T20:32:47.4212722Z )
2025-05-07T20:32:47.4212951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4213130Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4213135Z 
2025-05-07T20:32:47.4213217Z     @given(
2025-05-07T20:32:47.4213579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4213738Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4213883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4214009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4214127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4214205Z     )
2025-05-07T20:32:47.4214459Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4214555Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4214636Z         self,
2025-05-07T20:32:47.4214714Z         T: int,
2025-05-07T20:32:47.4214790Z         D: int,
2025-05-07T20:32:47.4214893Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4214986Z         contiguous: bool,
2025-05-07T20:32:47.4215079Z         compiled: bool,
2025-05-07T20:32:47.4215159Z     ) -> None:
2025-05-07T20:32:47.4215256Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4215332Z     
2025-05-07T20:32:47.4215506Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4217338Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4217347Z 
2025-05-07T20:32:47.4217472Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4217477Z 
2025-05-07T20:32:47.4217582Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4217818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4217899Z     T=2048,
2025-05-07T20:32:47.4217978Z     D=5120,
2025-05-07T20:32:47.4218068Z     scale_ub=None,
2025-05-07T20:32:47.4218155Z     contiguous=False,
2025-05-07T20:32:47.4218349Z     compiled=False,
2025-05-07T20:32:47.4218429Z )
2025-05-07T20:32:47.4218655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4218838Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.4218843Z 
2025-05-07T20:32:47.4218921Z     @given(
2025-05-07T20:32:47.4219043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4219147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4219264Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4219444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4219566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4219642Z     )
2025-05-07T20:32:47.4219901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4220054Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4220132Z         self,
2025-05-07T20:32:47.4220216Z         T: int,
2025-05-07T20:32:47.4220348Z         D: int,
2025-05-07T20:32:47.4220449Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4220544Z         contiguous: bool,
2025-05-07T20:32:47.4220631Z         compiled: bool,
2025-05-07T20:32:47.4220710Z     ) -> None:
2025-05-07T20:32:47.4220810Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4220885Z     
2025-05-07T20:32:47.4221057Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4222884Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4222894Z 
2025-05-07T20:32:47.4223016Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4223024Z 
2025-05-07T20:32:47.4223128Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4223359Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4223441Z     T=4096,
2025-05-07T20:32:47.4223521Z     D=7168,
2025-05-07T20:32:47.4223605Z     scale_ub=None,
2025-05-07T20:32:47.4223696Z     contiguous=True,
2025-05-07T20:32:47.4223781Z     compiled=True,
2025-05-07T20:32:47.4223860Z )
2025-05-07T20:32:47.4224086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4224262Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.4224267Z 
2025-05-07T20:32:47.4224347Z     @given(
2025-05-07T20:32:47.4224470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4224573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4224696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4224816Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4224932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4225012Z     )
2025-05-07T20:32:47.4225265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4225362Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4225445Z         self,
2025-05-07T20:32:47.4225522Z         T: int,
2025-05-07T20:32:47.4225604Z         D: int,
2025-05-07T20:32:47.4225708Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4225798Z         contiguous: bool,
2025-05-07T20:32:47.4225888Z         compiled: bool,
2025-05-07T20:32:47.4225968Z     ) -> None:
2025-05-07T20:32:47.4226065Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4226141Z     
2025-05-07T20:32:47.4226365Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4228304Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4228349Z 
2025-05-07T20:32:47.4228471Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4228476Z 
2025-05-07T20:32:47.4228582Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4228817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4228934Z     T=2048,
2025-05-07T20:32:47.4229014Z     D=5120,
2025-05-07T20:32:47.4229102Z     scale_ub=1200.0,
2025-05-07T20:32:47.4229228Z     contiguous=False,
2025-05-07T20:32:47.4229319Z     compiled=False,
2025-05-07T20:32:47.4229392Z )
2025-05-07T20:32:47.4229617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4229803Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.4229807Z 
2025-05-07T20:32:47.4229884Z     @given(
2025-05-07T20:32:47.4230004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4230107Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4230226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4230350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4230468Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4230544Z     )
2025-05-07T20:32:47.4230800Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4230904Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4230987Z         self,
2025-05-07T20:32:47.4231069Z         T: int,
2025-05-07T20:32:47.4231147Z         D: int,
2025-05-07T20:32:47.4231247Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4231341Z         contiguous: bool,
2025-05-07T20:32:47.4231429Z         compiled: bool,
2025-05-07T20:32:47.4231509Z     ) -> None:
2025-05-07T20:32:47.4231609Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4231683Z     
2025-05-07T20:32:47.4231861Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4233689Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4233698Z 
2025-05-07T20:32:47.4233822Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4233827Z 
2025-05-07T20:32:47.4233933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4234164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4234245Z     T=4096,
2025-05-07T20:32:47.4234323Z     D=7168,
2025-05-07T20:32:47.4234408Z     scale_ub=1200.0,
2025-05-07T20:32:47.4234502Z     contiguous=True,
2025-05-07T20:32:47.4234589Z     compiled=False,
2025-05-07T20:32:47.4234664Z )
2025-05-07T20:32:47.4234893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4235073Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4235080Z 
2025-05-07T20:32:47.4235161Z     @given(
2025-05-07T20:32:47.4235331Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4235436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4235556Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4235675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4235791Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4235868Z     )
2025-05-07T20:32:47.4236122Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4236217Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4236339Z         self,
2025-05-07T20:32:47.4236418Z         T: int,
2025-05-07T20:32:47.4236497Z         D: int,
2025-05-07T20:32:47.4236598Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4236689Z         contiguous: bool,
2025-05-07T20:32:47.4236779Z         compiled: bool,
2025-05-07T20:32:47.4236858Z     ) -> None:
2025-05-07T20:32:47.4236999Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4237075Z     
2025-05-07T20:32:47.4237312Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4239151Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4239159Z 
2025-05-07T20:32:47.4239279Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4239283Z 
2025-05-07T20:32:47.4239388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4239625Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4239707Z     T=16384,
2025-05-07T20:32:47.4239790Z     D=7168,
2025-05-07T20:32:47.4239875Z     scale_ub=None,
2025-05-07T20:32:47.4239964Z     contiguous=False,
2025-05-07T20:32:47.4240052Z     compiled=True,
2025-05-07T20:32:47.4240247Z )
2025-05-07T20:32:47.4240474Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4240658Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:47.4240663Z 
2025-05-07T20:32:47.4240741Z     @given(
2025-05-07T20:32:47.4240863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4240972Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4241089Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4241214Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4241331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4241412Z     )
2025-05-07T20:32:47.4241672Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4241772Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4241850Z         self,
2025-05-07T20:32:47.4241931Z         T: int,
2025-05-07T20:32:47.4242009Z         D: int,
2025-05-07T20:32:47.4242108Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4242205Z         contiguous: bool,
2025-05-07T20:32:47.4242292Z         compiled: bool,
2025-05-07T20:32:47.4242371Z     ) -> None:
2025-05-07T20:32:47.4242469Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4242544Z     
2025-05-07T20:32:47.4242722Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4244600Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4244608Z 
2025-05-07T20:32:47.4244737Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4244741Z 
2025-05-07T20:32:47.4244845Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4245075Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4245159Z     T=4096,
2025-05-07T20:32:47.4245280Z     D=7168,
2025-05-07T20:32:47.4245364Z     scale_ub=None,
2025-05-07T20:32:47.4245452Z     contiguous=True,
2025-05-07T20:32:47.4245536Z     compiled=False,
2025-05-07T20:32:47.4245611Z )
2025-05-07T20:32:47.4245839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4246057Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4246064Z 
2025-05-07T20:32:47.4246146Z     @given(
2025-05-07T20:32:47.4246305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4246407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4246526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4246646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4246761Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4246839Z     )
2025-05-07T20:32:47.4247093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4247191Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4247273Z         self,
2025-05-07T20:32:47.4247350Z         T: int,
2025-05-07T20:32:47.4247432Z         D: int,
2025-05-07T20:32:47.4247533Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4247625Z         contiguous: bool,
2025-05-07T20:32:47.4247718Z         compiled: bool,
2025-05-07T20:32:47.4247796Z     ) -> None:
2025-05-07T20:32:47.4247894Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4247977Z     
2025-05-07T20:32:47.4248152Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4249982Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4249990Z 
2025-05-07T20:32:47.4250112Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4250116Z 
2025-05-07T20:32:47.4250224Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4250461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4250541Z     T=16384,
2025-05-07T20:32:47.4250622Z     D=7168,
2025-05-07T20:32:47.4250705Z     scale_ub=None,
2025-05-07T20:32:47.4250790Z     contiguous=True,
2025-05-07T20:32:47.4250877Z     compiled=False,
2025-05-07T20:32:47.4250951Z )
2025-05-07T20:32:47.4251176Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4251362Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:47.4251366Z 
2025-05-07T20:32:47.4251446Z     @given(
2025-05-07T20:32:47.4251566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4251669Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4251785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4251909Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4252028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4252104Z     )
2025-05-07T20:32:47.4252412Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4252510Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4252588Z         self,
2025-05-07T20:32:47.4252669Z         T: int,
2025-05-07T20:32:47.4252746Z         D: int,
2025-05-07T20:32:47.4252845Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4252939Z         contiguous: bool,
2025-05-07T20:32:47.4253026Z         compiled: bool,
2025-05-07T20:32:47.4253105Z     ) -> None:
2025-05-07T20:32:47.4253205Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4253320Z     
2025-05-07T20:32:47.4253497Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4255362Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4255404Z 
2025-05-07T20:32:47.4255532Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4255536Z 
2025-05-07T20:32:47.4255642Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4255872Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4255956Z     T=16384,
2025-05-07T20:32:47.4256035Z     D=7168,
2025-05-07T20:32:47.4256121Z     scale_ub=1200.0,
2025-05-07T20:32:47.4256210Z     contiguous=True,
2025-05-07T20:32:47.4256296Z     compiled=False,
2025-05-07T20:32:47.4256371Z )
2025-05-07T20:32:47.4256596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4256786Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4256791Z 
2025-05-07T20:32:47.4256872Z     @given(
2025-05-07T20:32:47.4256994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4257094Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4257214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4257333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4257450Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4257527Z     )
2025-05-07T20:32:47.4257785Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4257881Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4257961Z         self,
2025-05-07T20:32:47.4258038Z         T: int,
2025-05-07T20:32:47.4258120Z         D: int,
2025-05-07T20:32:47.4258221Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4258314Z         contiguous: bool,
2025-05-07T20:32:47.4258408Z         compiled: bool,
2025-05-07T20:32:47.4258489Z     ) -> None:
2025-05-07T20:32:47.4258585Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4258663Z     
2025-05-07T20:32:47.4258840Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4260669Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4260678Z 
2025-05-07T20:32:47.4260797Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4260804Z 
2025-05-07T20:32:47.4260955Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4261193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4261273Z     T=128,
2025-05-07T20:32:47.4261354Z     D=5120,
2025-05-07T20:32:47.4261439Z     scale_ub=1200.0,
2025-05-07T20:32:47.4261526Z     contiguous=False,
2025-05-07T20:32:47.4261616Z     compiled=False,
2025-05-07T20:32:47.4261690Z )
2025-05-07T20:32:47.4261914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4262096Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:47.4262141Z 
2025-05-07T20:32:47.4262220Z     @given(
2025-05-07T20:32:47.4262340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4262445Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4262561Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4262724Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4262843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4262955Z     )
2025-05-07T20:32:47.4263214Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4263309Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4263386Z         self,
2025-05-07T20:32:47.4263465Z         T: int,
2025-05-07T20:32:47.4263542Z         D: int,
2025-05-07T20:32:47.4263641Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4263734Z         contiguous: bool,
2025-05-07T20:32:47.4263821Z         compiled: bool,
2025-05-07T20:32:47.4263902Z     ) -> None:
2025-05-07T20:32:47.4264001Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4264074Z     
2025-05-07T20:32:47.4264252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4264327Z     
2025-05-07T20:32:47.4264423Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4264560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4264653Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4264738Z         x0 = x[:, :D]
2025-05-07T20:32:47.4264821Z         x1 = x[:, D:]
2025-05-07T20:32:47.4264894Z     
2025-05-07T20:32:47.4264978Z         if contiguous:
2025-05-07T20:32:47.4265075Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4265168Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4265241Z     
2025-05-07T20:32:47.4265337Z         if scale_ub is not None:
2025-05-07T20:32:47.4265446Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4265585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4265667Z             )
2025-05-07T20:32:47.4265744Z         else:
2025-05-07T20:32:47.4265843Z             scale_ub_tensor = None
2025-05-07T20:32:47.4265916Z     
2025-05-07T20:32:47.4266049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4266143Z             op = silu_mul_quant
2025-05-07T20:32:47.4266232Z             if compiled:
2025-05-07T20:32:47.4266335Z                 op = torch.compile(op)
2025-05-07T20:32:47.4266449Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4266522Z     
2025-05-07T20:32:47.4266617Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4266621Z 
2025-05-07T20:32:47.4266722Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4266854Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4266961Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4267067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4267585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4267690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4271542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4271803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4272240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4272345Z     kernel = self.compile(
2025-05-07T20:32:47.4272747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4272929Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4273065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4273070Z 
2025-05-07T20:32:47.4273283Z self = <triton.compiler.compiler.ASTSource object at 0x7f07aca12350>
2025-05-07T20:32:47.4274136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4274775Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ac7b11c0>}
2025-05-07T20:32:47.4275543Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4275745Z context = <triton._C.libtriton.ir.context object at 0x7f07ac863770>
2025-05-07T20:32:47.4275750Z 
2025-05-07T20:32:47.4275923Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4276203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4276314Z                            module_map=module_map)
2025-05-07T20:32:47.4276481Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4276585Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4276666Z E       ^
2025-05-07T20:32:47.4277042Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4277050Z 
2025-05-07T20:32:47.4277481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4277485Z 
2025-05-07T20:32:47.4277592Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4277826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4277906Z     T=2048,
2025-05-07T20:32:47.4277982Z     D=7168,
2025-05-07T20:32:47.4278072Z     scale_ub=None,
2025-05-07T20:32:47.4278160Z     contiguous=False,
2025-05-07T20:32:47.4278245Z     compiled=False,
2025-05-07T20:32:47.4278324Z )
2025-05-07T20:32:47.4278550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4278735Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:47.4278742Z 
2025-05-07T20:32:47.4278824Z     @given(
2025-05-07T20:32:47.4278950Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4279056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4279176Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4279296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4279414Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4279492Z     )
2025-05-07T20:32:47.4279749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4279845Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4279925Z         self,
2025-05-07T20:32:47.4280005Z         T: int,
2025-05-07T20:32:47.4280163Z         D: int,
2025-05-07T20:32:47.4280266Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4280359Z         contiguous: bool,
2025-05-07T20:32:47.4280446Z         compiled: bool,
2025-05-07T20:32:47.4280529Z     ) -> None:
2025-05-07T20:32:47.4280628Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4280750Z     
2025-05-07T20:32:47.4280933Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4282775Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4282822Z 
2025-05-07T20:32:47.4282944Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4282949Z 
2025-05-07T20:32:47.4283058Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4283332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4283450Z     T=128,
2025-05-07T20:32:47.4283530Z     D=7168,
2025-05-07T20:32:47.4283616Z     scale_ub=1200.0,
2025-05-07T20:32:47.4283709Z     contiguous=True,
2025-05-07T20:32:47.4283794Z     compiled=True,
2025-05-07T20:32:47.4283869Z )
2025-05-07T20:32:47.4284099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4284273Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4284278Z 
2025-05-07T20:32:47.4284356Z     @given(
2025-05-07T20:32:47.4284480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4284585Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4284707Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4284829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4284948Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4285030Z     )
2025-05-07T20:32:47.4285288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4285385Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4285467Z         self,
2025-05-07T20:32:47.4285545Z         T: int,
2025-05-07T20:32:47.4285623Z         D: int,
2025-05-07T20:32:47.4285727Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4285818Z         contiguous: bool,
2025-05-07T20:32:47.4285905Z         compiled: bool,
2025-05-07T20:32:47.4285989Z     ) -> None:
2025-05-07T20:32:47.4286085Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4286161Z     
2025-05-07T20:32:47.4286339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4286414Z     
2025-05-07T20:32:47.4286514Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4286642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4286734Z         x = x_sign * x_clamp
2025-05-07T20:32:47.4286822Z         x0 = x[:, :D]
2025-05-07T20:32:47.4286904Z         x1 = x[:, D:]
2025-05-07T20:32:47.4286982Z     
2025-05-07T20:32:47.4287074Z         if contiguous:
2025-05-07T20:32:47.4287169Z             x0 = x0.contiguous()
2025-05-07T20:32:47.4287260Z             x1 = x1.contiguous()
2025-05-07T20:32:47.4287339Z     
2025-05-07T20:32:47.4287431Z         if scale_ub is not None:
2025-05-07T20:32:47.4287544Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:47.4287684Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:47.4287760Z             )
2025-05-07T20:32:47.4287840Z         else:
2025-05-07T20:32:47.4287936Z             scale_ub_tensor = None
2025-05-07T20:32:47.4288013Z     
2025-05-07T20:32:47.4288149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:47.4288240Z             op = silu_mul_quant
2025-05-07T20:32:47.4288327Z             if compiled:
2025-05-07T20:32:47.4288433Z                 op = torch.compile(op)
2025-05-07T20:32:47.4288545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4288618Z     
2025-05-07T20:32:47.4288766Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:47.4288771Z 
2025-05-07T20:32:47.4288872Z moe/activation_test.py:117: 
2025-05-07T20:32:47.4289008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4289111Z moe/activation_test.py:115: in fn
2025-05-07T20:32:47.4289213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:47.4289599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:47.4289694Z     return fn(*args, **kwargs)
2025-05-07T20:32:47.4290244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:47.4290348Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:47.4290717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:47.4290998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:47.4291388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:47.4291487Z     kernel = self.compile(
2025-05-07T20:32:47.4291885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:47.4292067Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:47.4292198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:47.4292207Z 
2025-05-07T20:32:47.4292421Z self = <triton.compiler.compiler.ASTSource object at 0x7f07acba6450>
2025-05-07T20:32:47.4293224Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:47.4293759Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f08e1577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f07ac8a3b00>}
2025-05-07T20:32:47.4294529Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:47.4294733Z context = <triton._C.libtriton.ir.context object at 0x7f07ac6cbef0>
2025-05-07T20:32:47.4294738Z 
2025-05-07T20:32:47.4294913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:47.4295189Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:47.4295303Z                            module_map=module_map)
2025-05-07T20:32:47.4295470Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:47.4295578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:47.4295657Z E       ^
2025-05-07T20:32:47.4296026Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:47.4296030Z 
2025-05-07T20:32:47.4296464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:47.4296468Z 
2025-05-07T20:32:47.4296574Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4296805Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4296891Z     T=128,
2025-05-07T20:32:47.4296967Z     D=7168,
2025-05-07T20:32:47.4297055Z     scale_ub=1200.0,
2025-05-07T20:32:47.4297141Z     contiguous=True,
2025-05-07T20:32:47.4297225Z     compiled=False,
2025-05-07T20:32:47.4297306Z )
2025-05-07T20:32:47.4297532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4297758Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:47.4297762Z 
2025-05-07T20:32:47.4297850Z     @given(
2025-05-07T20:32:47.4297973Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4298075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4298196Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4298316Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4298435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4298510Z     )
2025-05-07T20:32:47.4298767Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4298917Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4299016Z         self,
2025-05-07T20:32:47.4299098Z         T: int,
2025-05-07T20:32:47.4299200Z         D: int,
2025-05-07T20:32:47.4299302Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4299393Z         contiguous: bool,
2025-05-07T20:32:47.4299525Z         compiled: bool,
2025-05-07T20:32:47.4299607Z     ) -> None:
2025-05-07T20:32:47.4299742Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4299819Z     
2025-05-07T20:32:47.4299996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4300075Z     
2025-05-07T20:32:47.4300169Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4300298Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4302146Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4302157Z 
2025-05-07T20:32:47.4302287Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.4302292Z 
2025-05-07T20:32:47.4302401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4302635Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4302713Z     T=128,
2025-05-07T20:32:47.4302793Z     D=5120,
2025-05-07T20:32:47.4302878Z     scale_ub=1200.0,
2025-05-07T20:32:47.4302964Z     contiguous=True,
2025-05-07T20:32:47.4303051Z     compiled=True,
2025-05-07T20:32:47.4303123Z )
2025-05-07T20:32:47.4303353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4303530Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:47.4303535Z 
2025-05-07T20:32:47.4303614Z     @given(
2025-05-07T20:32:47.4303738Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4303841Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4303965Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4304089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4304205Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4304280Z     )
2025-05-07T20:32:47.4304536Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4304633Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4304716Z         self,
2025-05-07T20:32:47.4304793Z         T: int,
2025-05-07T20:32:47.4304871Z         D: int,
2025-05-07T20:32:47.4304976Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4305074Z         contiguous: bool,
2025-05-07T20:32:47.4305162Z         compiled: bool,
2025-05-07T20:32:47.4305244Z     ) -> None:
2025-05-07T20:32:47.4305341Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4305414Z     
2025-05-07T20:32:47.4305592Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4305669Z     
2025-05-07T20:32:47.4305762Z         x_sign = torch.sign(x)
2025-05-07T20:32:47.4305945Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:47.4307767Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4307836Z 
2025-05-07T20:32:47.4307958Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:47.4307962Z 
2025-05-07T20:32:47.4308068Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:47.4308302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:47.4308421Z     T=128,
2025-05-07T20:32:47.4308498Z     D=7168,
2025-05-07T20:32:47.4308622Z     scale_ub=None,
2025-05-07T20:32:47.4308710Z     contiguous=True,
2025-05-07T20:32:47.4308794Z     compiled=True,
2025-05-07T20:32:47.4308870Z )
2025-05-07T20:32:47.4309097Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:47.4309294Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:47.4309303Z 
2025-05-07T20:32:47.4309389Z     @given(
2025-05-07T20:32:47.4309530Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:47.4309637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:47.4309755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:47.4309875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:47.4309996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:47.4310071Z     )
2025-05-07T20:32:47.4310331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:47.4310432Z     def test_silu_mul_quant(
2025-05-07T20:32:47.4310509Z         self,
2025-05-07T20:32:47.4310587Z         T: int,
2025-05-07T20:32:47.4310667Z         D: int,
2025-05-07T20:32:47.4310768Z         scale_ub: Optional[float],
2025-05-07T20:32:47.4310862Z         contiguous: bool,
2025-05-07T20:32:47.4310950Z         compiled: bool,
2025-05-07T20:32:47.4311029Z     ) -> None:
2025-05-07T20:32:47.4311127Z         torch.manual_seed(2025)
2025-05-07T20:32:47.4311201Z     
2025-05-07T20:32:47.4311375Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:47.4313201Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:47.4313210Z 
2025-05-07T20:32:47.4313518Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:47.4313727Z =============================== warnings summary ===============================
2025-05-07T20:32:47.4314059Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:47.4314379Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:47.4314689Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:47.4315686Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:47.4315935Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:47.4315940Z 
2025-05-07T20:32:47.4316158Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:47.4316335Z ================= 1 failed, 1 deselected, 3 warnings in 12.03s =================
2025-05-07T20:32:49.0335082Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:49.0949701Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:32:49.0949944Z 
2025-05-07T20:32:51.0969800Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:53.2572544Z ============================= test session starts ==============================
2025-05-07T20:32:53.2573816Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:53.2574888Z cachedir: .pytest_cache
2025-05-07T20:32:53.2576079Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:53.2577589Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:53.2578432Z plugins: hypothesis-6.131.14
2025-05-07T20:32:54.8127502Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:54.9101688Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:54.9102121Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:54.9102354Z 
2025-05-07T20:32:57.0258293Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.0258998Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.0259426Z     T=1,
2025-05-07T20:32:57.0259627Z     D=5120,
2025-05-07T20:32:57.0259835Z     scale_ub=None,
2025-05-07T20:32:57.0260057Z     contiguous=True,
2025-05-07T20:32:57.0260294Z     compiled=True,
2025-05-07T20:32:57.0260516Z )
2025-05-07T20:32:57.0260854Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.0261373Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:57.0261693Z 
2025-05-07T20:32:57.0261777Z     @given(
2025-05-07T20:32:57.0262025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.0262362Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.0262691Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.0263048Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.0263403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.0263709Z     )
2025-05-07T20:32:57.0264079Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.0264550Z     def test_silu_mul_quant(
2025-05-07T20:32:57.0264813Z         self,
2025-05-07T20:32:57.0265042Z         T: int,
2025-05-07T20:32:57.0265252Z         D: int,
2025-05-07T20:32:57.0265489Z         scale_ub: Optional[float],
2025-05-07T20:32:57.0265780Z         contiguous: bool,
2025-05-07T20:32:57.0266034Z         compiled: bool,
2025-05-07T20:32:57.0266276Z     ) -> None:
2025-05-07T20:32:57.0266508Z         torch.manual_seed(2025)
2025-05-07T20:32:57.0266775Z     
2025-05-07T20:32:57.0267067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.0267436Z     
2025-05-07T20:32:57.0267645Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.0267954Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.0268564Z         x = x_sign * x_clamp
2025-05-07T20:32:57.0268831Z         x0 = x[:, :D]
2025-05-07T20:32:57.0269057Z         x1 = x[:, D:]
2025-05-07T20:32:57.0269279Z     
2025-05-07T20:32:57.0269485Z         if contiguous:
2025-05-07T20:32:57.0269725Z             x0 = x0.contiguous()
2025-05-07T20:32:57.0270002Z             x1 = x1.contiguous()
2025-05-07T20:32:57.0270261Z     
2025-05-07T20:32:57.0270462Z         if scale_ub is not None:
2025-05-07T20:32:57.0270753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.0271111Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.0271522Z             )
2025-05-07T20:32:57.0271727Z         else:
2025-05-07T20:32:57.0271951Z             scale_ub_tensor = None
2025-05-07T20:32:57.0272213Z     
2025-05-07T20:32:57.0272463Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0272802Z             op = silu_mul_quant
2025-05-07T20:32:57.0273152Z             if compiled:
2025-05-07T20:32:57.0273413Z                 op = torch.compile(op)
2025-05-07T20:32:57.0273806Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0274106Z     
2025-05-07T20:32:57.0274310Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.0274617Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.0274929Z     
2025-05-07T20:32:57.0275176Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0275532Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.0275842Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.0276171Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.0276558Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.0276891Z     
2025-05-07T20:32:57.0277111Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.0277319Z 
2025-05-07T20:32:57.0277428Z moe/activation_test.py:126: 
2025-05-07T20:32:57.0277752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0278116Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.0278463Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.0279306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.0280186Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.0280780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.0281509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.0282250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.0283024Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.0283805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.0284492Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.0285139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.0285694Z     fn()
2025-05-07T20:32:57.0286231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.0286856Z     self.fn.run(
2025-05-07T20:32:57.0287360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.0287924Z     kernel = self.compile(
2025-05-07T20:32:57.0288505Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.0289207Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.0289695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0289943Z 
2025-05-07T20:32:57.0290169Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7ae5c2270>
2025-05-07T20:32:57.0291322Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.0292796Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a7cae700>}
2025-05-07T20:32:57.0294263Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.0295347Z context = <triton._C.libtriton.ir.context object at 0x7ff7f608b2f0>
2025-05-07T20:32:57.0295695Z 
2025-05-07T20:32:57.0295914Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.0296479Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.0296980Z                            module_map=module_map)
2025-05-07T20:32:57.0297369Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.0297750Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.0298038Z E       ^
2025-05-07T20:32:57.0298535Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.0299018Z 
2025-05-07T20:32:57.0299461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.0300013Z 
2025-05-07T20:32:57.0300126Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.0300575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.0301007Z     T=2048,
2025-05-07T20:32:57.0301206Z     D=5120,
2025-05-07T20:32:57.0301416Z     scale_ub=1200.0,
2025-05-07T20:32:57.0301654Z     contiguous=True,
2025-05-07T20:32:57.0301885Z     compiled=False,
2025-05-07T20:32:57.0302104Z )
2025-05-07T20:32:57.0302443Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.0302970Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:57.0303268Z 
2025-05-07T20:32:57.0303350Z     @given(
2025-05-07T20:32:57.0303597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.0303932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.0304263Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.0304617Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.0304972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.0305275Z     )
2025-05-07T20:32:57.0305653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.0306127Z     def test_silu_mul_quant(
2025-05-07T20:32:57.0306379Z         self,
2025-05-07T20:32:57.0306588Z         T: int,
2025-05-07T20:32:57.0306798Z         D: int,
2025-05-07T20:32:57.0307026Z         scale_ub: Optional[float],
2025-05-07T20:32:57.0307316Z         contiguous: bool,
2025-05-07T20:32:57.0307573Z         compiled: bool,
2025-05-07T20:32:57.0307810Z     ) -> None:
2025-05-07T20:32:57.0308047Z         torch.manual_seed(2025)
2025-05-07T20:32:57.0308311Z     
2025-05-07T20:32:57.0308599Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.0308973Z     
2025-05-07T20:32:57.0309185Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.0309494Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.0309831Z         x = x_sign * x_clamp
2025-05-07T20:32:57.0310090Z         x0 = x[:, :D]
2025-05-07T20:32:57.0310328Z         x1 = x[:, D:]
2025-05-07T20:32:57.0310547Z     
2025-05-07T20:32:57.0310795Z         if contiguous:
2025-05-07T20:32:57.0311042Z             x0 = x0.contiguous()
2025-05-07T20:32:57.0311312Z             x1 = x1.contiguous()
2025-05-07T20:32:57.0311566Z     
2025-05-07T20:32:57.0311768Z         if scale_ub is not None:
2025-05-07T20:32:57.0312054Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.0312410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.0312741Z             )
2025-05-07T20:32:57.0312941Z         else:
2025-05-07T20:32:57.0313164Z             scale_ub_tensor = None
2025-05-07T20:32:57.0313725Z     
2025-05-07T20:32:57.0313967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0314301Z             op = silu_mul_quant
2025-05-07T20:32:57.0314571Z             if compiled:
2025-05-07T20:32:57.0314828Z                 op = torch.compile(op)
2025-05-07T20:32:57.0315144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0315512Z     
2025-05-07T20:32:57.0315719Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.0315955Z 
2025-05-07T20:32:57.0316065Z moe/activation_test.py:117: 
2025-05-07T20:32:57.0316383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0316740Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.0317039Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0317770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.0318501Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.0319074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.0319806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.0320600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.0321175Z     kernel = self.compile(
2025-05-07T20:32:57.0321778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.0322505Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.0322931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0323179Z 
2025-05-07T20:32:57.0323407Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a7c41090>
2025-05-07T20:32:57.0324548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.0326003Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a7b62020>}
2025-05-07T20:32:57.0327440Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.0328523Z context = <triton._C.libtriton.ir.context object at 0x7ff7ac1b5ef0>
2025-05-07T20:32:57.0328832Z 
2025-05-07T20:32:57.0329019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.0329575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.0330079Z                            module_map=module_map)
2025-05-07T20:32:57.0330471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.0330846Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.0331127Z E       ^
2025-05-07T20:32:57.0331624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.0332105Z 
2025-05-07T20:32:57.0332628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.6885269Z 
2025-05-07T20:32:57.6885753Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.6886431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.6886948Z     T=2048,
2025-05-07T20:32:57.6887146Z     D=5120,
2025-05-07T20:32:57.6887349Z     scale_ub=1200.0,
2025-05-07T20:32:57.6887576Z     contiguous=True,
2025-05-07T20:32:57.6887807Z     compiled=True,
2025-05-07T20:32:57.6888324Z )
2025-05-07T20:32:57.6888653Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.6889169Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.6889456Z 
2025-05-07T20:32:57.6889563Z     @given(
2025-05-07T20:32:57.6889801Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.6890261Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.6890659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.6891001Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.6891344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.6891646Z     )
2025-05-07T20:32:57.6892004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.6892465Z     def test_silu_mul_quant(
2025-05-07T20:32:57.6892719Z         self,
2025-05-07T20:32:57.6892917Z         T: int,
2025-05-07T20:32:57.6893121Z         D: int,
2025-05-07T20:32:57.6893349Z         scale_ub: Optional[float],
2025-05-07T20:32:57.6893626Z         contiguous: bool,
2025-05-07T20:32:57.6893875Z         compiled: bool,
2025-05-07T20:32:57.6894126Z     ) -> None:
2025-05-07T20:32:57.6894350Z         torch.manual_seed(2025)
2025-05-07T20:32:57.6900938Z     
2025-05-07T20:32:57.6901238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.6901615Z     
2025-05-07T20:32:57.6901829Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.6902143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.6902478Z         x = x_sign * x_clamp
2025-05-07T20:32:57.6902726Z         x0 = x[:, :D]
2025-05-07T20:32:57.6902955Z         x1 = x[:, D:]
2025-05-07T20:32:57.6903176Z     
2025-05-07T20:32:57.6903369Z         if contiguous:
2025-05-07T20:32:57.6903619Z             x0 = x0.contiguous()
2025-05-07T20:32:57.6903897Z             x1 = x1.contiguous()
2025-05-07T20:32:57.6904142Z     
2025-05-07T20:32:57.6904347Z         if scale_ub is not None:
2025-05-07T20:32:57.6904640Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.6904992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.6905321Z             )
2025-05-07T20:32:57.6905531Z         else:
2025-05-07T20:32:57.6905749Z             scale_ub_tensor = None
2025-05-07T20:32:57.6906022Z     
2025-05-07T20:32:57.6906273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6906604Z             op = silu_mul_quant
2025-05-07T20:32:57.6906871Z             if compiled:
2025-05-07T20:32:57.6907134Z                 op = torch.compile(op)
2025-05-07T20:32:57.6907448Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6907733Z     
2025-05-07T20:32:57.6907938Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.6908239Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.6908545Z     
2025-05-07T20:32:57.6908797Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6909158Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.6909461Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.6909792Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.6910173Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.6910495Z     
2025-05-07T20:32:57.6910710Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.6911048Z 
2025-05-07T20:32:57.6911169Z moe/activation_test.py:126: 
2025-05-07T20:32:57.6911482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6911834Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.6912229Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.6913057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.6914197Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.6914873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.6915594Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.6916318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.6917197Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.6917971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.6918645Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.6919285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.6919822Z     fn()
2025-05-07T20:32:57.6920433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.6921046Z     self.fn.run(
2025-05-07T20:32:57.6921528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.6922088Z     kernel = self.compile(
2025-05-07T20:32:57.6922660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.6923356Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.6923769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6924018Z 
2025-05-07T20:32:57.6924239Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a7c420d0>
2025-05-07T20:32:57.6925382Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.6926828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6c44400>}
2025-05-07T20:32:57.6928224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.6929303Z context = <triton._C.libtriton.ir.context object at 0x7ff7a681d2b0>
2025-05-07T20:32:57.6929616Z 
2025-05-07T20:32:57.6929791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.6930339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.6930826Z                            module_map=module_map)
2025-05-07T20:32:57.6931213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.6931595Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.6931869Z E       ^
2025-05-07T20:32:57.6932360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.6932835Z 
2025-05-07T20:32:57.6933269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.6933872Z 
2025-05-07T20:32:57.6933994Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.6934429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.6934854Z     T=16384,
2025-05-07T20:32:57.6935059Z     D=7168,
2025-05-07T20:32:57.6935256Z     scale_ub=1200.0,
2025-05-07T20:32:57.6935489Z     contiguous=False,
2025-05-07T20:32:57.6935730Z     compiled=False,
2025-05-07T20:32:57.6935943Z )
2025-05-07T20:32:57.6936282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.6936810Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:57.6937152Z 
2025-05-07T20:32:57.6937240Z     @given(
2025-05-07T20:32:57.6937480Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.6937816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.6938142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.6938531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.6938918Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.6939232Z     )
2025-05-07T20:32:57.6939596Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.6940059Z     def test_silu_mul_quant(
2025-05-07T20:32:57.6940313Z         self,
2025-05-07T20:32:57.6940518Z         T: int,
2025-05-07T20:32:57.6940719Z         D: int,
2025-05-07T20:32:57.6940948Z         scale_ub: Optional[float],
2025-05-07T20:32:57.6941234Z         contiguous: bool,
2025-05-07T20:32:57.6941477Z         compiled: bool,
2025-05-07T20:32:57.6941715Z     ) -> None:
2025-05-07T20:32:57.6941942Z         torch.manual_seed(2025)
2025-05-07T20:32:57.6942186Z     
2025-05-07T20:32:57.6942473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.6942832Z     
2025-05-07T20:32:57.6943029Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.6943338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.6943668Z         x = x_sign * x_clamp
2025-05-07T20:32:57.6943915Z         x0 = x[:, :D]
2025-05-07T20:32:57.6944146Z         x1 = x[:, D:]
2025-05-07T20:32:57.6944366Z     
2025-05-07T20:32:57.6944556Z         if contiguous:
2025-05-07T20:32:57.6944805Z             x0 = x0.contiguous()
2025-05-07T20:32:57.6945083Z             x1 = x1.contiguous()
2025-05-07T20:32:57.6945336Z     
2025-05-07T20:32:57.6945534Z         if scale_ub is not None:
2025-05-07T20:32:57.6945822Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.6946180Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.6946505Z             )
2025-05-07T20:32:57.6946714Z         else:
2025-05-07T20:32:57.6946943Z             scale_ub_tensor = None
2025-05-07T20:32:57.6947201Z     
2025-05-07T20:32:57.6947448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.6947783Z             op = silu_mul_quant
2025-05-07T20:32:57.6948040Z             if compiled:
2025-05-07T20:32:57.6948306Z                 op = torch.compile(op)
2025-05-07T20:32:57.6948618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6948901Z     
2025-05-07T20:32:57.6949112Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.6949283Z 
2025-05-07T20:32:57.6949398Z moe/activation_test.py:117: 
2025-05-07T20:32:57.6949709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6950053Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.6950353Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.6951072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.6951781Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.6952344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.6953111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.6953808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.6954360Z     kernel = self.compile(
2025-05-07T20:32:57.6954929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.6955624Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.6956032Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.6956321Z 
2025-05-07T20:32:57.6956536Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a69c5220>
2025-05-07T20:32:57.6957655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.6959166Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6a271a0>}
2025-05-07T20:32:57.6960628Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.6961689Z context = <triton._C.libtriton.ir.context object at 0x7ff7a685d930>
2025-05-07T20:32:57.6961998Z 
2025-05-07T20:32:57.6962172Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.6962721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.6963210Z                            module_map=module_map)
2025-05-07T20:32:57.6963586Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.6963960Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.6964233Z E       ^
2025-05-07T20:32:57.6964719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.6965192Z 
2025-05-07T20:32:57.6965625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.3892432Z 
2025-05-07T20:32:58.3892933Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3893422Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3893839Z     T=1,
2025-05-07T20:32:58.3894057Z     D=7168,
2025-05-07T20:32:58.3894260Z     scale_ub=None,
2025-05-07T20:32:58.3894480Z     contiguous=True,
2025-05-07T20:32:58.3894714Z     compiled=True,
2025-05-07T20:32:58.3894930Z )
2025-05-07T20:32:58.3895264Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3895763Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.3896047Z 
2025-05-07T20:32:58.3896133Z     @given(
2025-05-07T20:32:58.3896381Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3896702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3897023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3897369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3897704Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3898001Z     )
2025-05-07T20:32:58.3898362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3898823Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3899066Z         self,
2025-05-07T20:32:58.3899269Z         T: int,
2025-05-07T20:32:58.3899474Z         D: int,
2025-05-07T20:32:58.3899694Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3899977Z         contiguous: bool,
2025-05-07T20:32:58.3900227Z         compiled: bool,
2025-05-07T20:32:58.3900462Z     ) -> None:
2025-05-07T20:32:58.3900973Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3901235Z     
2025-05-07T20:32:58.3901513Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3901873Z     
2025-05-07T20:32:58.3902087Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.3902393Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.3902710Z         x = x_sign * x_clamp
2025-05-07T20:32:58.3902961Z         x0 = x[:, :D]
2025-05-07T20:32:58.3903191Z         x1 = x[:, D:]
2025-05-07T20:32:58.3903401Z     
2025-05-07T20:32:58.3903600Z         if contiguous:
2025-05-07T20:32:58.3903923Z             x0 = x0.contiguous()
2025-05-07T20:32:58.3904195Z             x1 = x1.contiguous()
2025-05-07T20:32:58.3904441Z     
2025-05-07T20:32:58.3904643Z         if scale_ub is not None:
2025-05-07T20:32:58.3904929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.3905277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.3905682Z             )
2025-05-07T20:32:58.3905892Z         else:
2025-05-07T20:32:58.3906176Z             scale_ub_tensor = None
2025-05-07T20:32:58.3906438Z     
2025-05-07T20:32:58.3906683Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.3907005Z             op = silu_mul_quant
2025-05-07T20:32:58.3907265Z             if compiled:
2025-05-07T20:32:58.3907547Z                 op = torch.compile(op)
2025-05-07T20:32:58.3907854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.3908139Z     
2025-05-07T20:32:58.3908333Z         y_fp8, y_scale = fn()
2025-05-07T20:32:58.3908629Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:58.3908933Z     
2025-05-07T20:32:58.3909176Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.3909522Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:58.3909826Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:58.3910148Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:58.3910527Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.3910850Z     
2025-05-07T20:32:58.3911061Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:58.3911262Z 
2025-05-07T20:32:58.3911367Z moe/activation_test.py:126: 
2025-05-07T20:32:58.3911679Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3912035Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:58.3912370Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:58.3913194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:58.3914259Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:58.3914827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.3915539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.3916253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:58.3917003Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:58.3917765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:58.3918426Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:58.3919054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:58.3919592Z     fn()
2025-05-07T20:32:58.3920217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:58.3920824Z     self.fn.run(
2025-05-07T20:32:58.3921311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.3921950Z     kernel = self.compile(
2025-05-07T20:32:58.3922511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.3923190Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.3923610Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3923848Z 
2025-05-07T20:32:58.3924064Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a69c7950>
2025-05-07T20:32:58.3925191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.3926689Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6d00860>}
2025-05-07T20:32:58.3928214Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.3929271Z context = <triton._C.libtriton.ir.context object at 0x7ff781dc9af0>
2025-05-07T20:32:58.3929571Z 
2025-05-07T20:32:58.3929746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.3930295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.3930785Z                            module_map=module_map)
2025-05-07T20:32:58.3931172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.3931540Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:58.3931817Z E       ^
2025-05-07T20:32:58.3932298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.3932770Z 
2025-05-07T20:32:58.3933204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.3933739Z 
2025-05-07T20:32:58.3933846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.3934277Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.3934694Z     T=4096,
2025-05-07T20:32:58.3934883Z     D=5120,
2025-05-07T20:32:58.3935082Z     scale_ub=None,
2025-05-07T20:32:58.3935307Z     contiguous=False,
2025-05-07T20:32:58.3935536Z     compiled=False,
2025-05-07T20:32:58.3935752Z )
2025-05-07T20:32:58.3936085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.3936596Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.3936886Z 
2025-05-07T20:32:58.3936965Z     @given(
2025-05-07T20:32:58.3937208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.3937533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.3937859Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.3938205Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.3938550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.3938842Z     )
2025-05-07T20:32:58.3939203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.3939662Z     def test_silu_mul_quant(
2025-05-07T20:32:58.3939906Z         self,
2025-05-07T20:32:58.3940107Z         T: int,
2025-05-07T20:32:58.3940313Z         D: int,
2025-05-07T20:32:58.3940536Z         scale_ub: Optional[float],
2025-05-07T20:32:58.3940819Z         contiguous: bool,
2025-05-07T20:32:58.3941069Z         compiled: bool,
2025-05-07T20:32:58.3941295Z     ) -> None:
2025-05-07T20:32:58.3941520Z         torch.manual_seed(2025)
2025-05-07T20:32:58.3941777Z     
2025-05-07T20:32:58.3942140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.3942523Z     
2025-05-07T20:32:58.3942730Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.3943035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.3943351Z         x = x_sign * x_clamp
2025-05-07T20:32:58.3943601Z         x0 = x[:, :D]
2025-05-07T20:32:58.3943825Z         x1 = x[:, D:]
2025-05-07T20:32:58.3944034Z     
2025-05-07T20:32:58.3944227Z         if contiguous:
2025-05-07T20:32:58.3944470Z             x0 = x0.contiguous()
2025-05-07T20:32:58.3944732Z             x1 = x1.contiguous()
2025-05-07T20:32:58.3944983Z     
2025-05-07T20:32:58.3945231Z         if scale_ub is not None:
2025-05-07T20:32:58.3945510Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.3945862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.3946185Z             )
2025-05-07T20:32:58.3946378Z         else:
2025-05-07T20:32:58.3946639Z             scale_ub_tensor = None
2025-05-07T20:32:58.3946900Z     
2025-05-07T20:32:58.3947177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.3947508Z             op = silu_mul_quant
2025-05-07T20:32:58.3947771Z             if compiled:
2025-05-07T20:32:58.3948022Z                 op = torch.compile(op)
2025-05-07T20:32:58.3948333Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.3948618Z     
2025-05-07T20:32:58.3948819Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.3948989Z 
2025-05-07T20:32:58.3949091Z moe/activation_test.py:117: 
2025-05-07T20:32:58.3949395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3949742Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.3950029Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.3950743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.3951461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.3952027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.3952730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.3953418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.3953969Z     kernel = self.compile(
2025-05-07T20:32:58.3954529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.3955210Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.3955627Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.3955863Z 
2025-05-07T20:32:58.3956083Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781d78b90>
2025-05-07T20:32:58.3957200Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.3958629Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6118cc0>}
2025-05-07T20:32:58.3960016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.3961170Z context = <triton._C.libtriton.ir.context object at 0x7ff781df2bf0>
2025-05-07T20:32:58.3961468Z 
2025-05-07T20:32:58.3961648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.3962211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.3962727Z                            module_map=module_map)
2025-05-07T20:32:58.3963160Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.3963526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.3963802Z E       ^
2025-05-07T20:32:58.3964283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.3964750Z 
2025-05-07T20:32:58.3965186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.1024343Z 
2025-05-07T20:32:59.1024694Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.1025510Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.1026049Z     T=4096,
2025-05-07T20:32:59.1026329Z     D=7168,
2025-05-07T20:32:59.1026594Z     scale_ub=None,
2025-05-07T20:32:59.1026884Z     contiguous=False,
2025-05-07T20:32:59.1027191Z     compiled=False,
2025-05-07T20:32:59.1027621Z )
2025-05-07T20:32:59.1028048Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.1028581Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:59.1028873Z 
2025-05-07T20:32:59.1028957Z     @given(
2025-05-07T20:32:59.1029200Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.1029528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.1029853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.1030200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.1030544Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.1030850Z     )
2025-05-07T20:32:59.1031222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.1031694Z     def test_silu_mul_quant(
2025-05-07T20:32:59.1031947Z         self,
2025-05-07T20:32:59.1032159Z         T: int,
2025-05-07T20:32:59.1032402Z         D: int,
2025-05-07T20:32:59.1032654Z         scale_ub: Optional[float],
2025-05-07T20:32:59.1032947Z         contiguous: bool,
2025-05-07T20:32:59.1033205Z         compiled: bool,
2025-05-07T20:32:59.1033442Z     ) -> None:
2025-05-07T20:32:59.1033673Z         torch.manual_seed(2025)
2025-05-07T20:32:59.1033930Z     
2025-05-07T20:32:59.1034251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.1034615Z     
2025-05-07T20:32:59.1034820Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.1035131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.1035463Z         x = x_sign * x_clamp
2025-05-07T20:32:59.1035713Z         x0 = x[:, :D]
2025-05-07T20:32:59.1035950Z         x1 = x[:, D:]
2025-05-07T20:32:59.1036166Z     
2025-05-07T20:32:59.1036377Z         if contiguous:
2025-05-07T20:32:59.1043241Z             x0 = x0.contiguous()
2025-05-07T20:32:59.1043544Z             x1 = x1.contiguous()
2025-05-07T20:32:59.1043814Z     
2025-05-07T20:32:59.1044027Z         if scale_ub is not None:
2025-05-07T20:32:59.1044315Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.1044679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.1045015Z             )
2025-05-07T20:32:59.1045220Z         else:
2025-05-07T20:32:59.1045437Z             scale_ub_tensor = None
2025-05-07T20:32:59.1045706Z     
2025-05-07T20:32:59.1045957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1046291Z             op = silu_mul_quant
2025-05-07T20:32:59.1046561Z             if compiled:
2025-05-07T20:32:59.1046825Z                 op = torch.compile(op)
2025-05-07T20:32:59.1047136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1047430Z     
2025-05-07T20:32:59.1047640Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.1047816Z 
2025-05-07T20:32:59.1047925Z moe/activation_test.py:117: 
2025-05-07T20:32:59.1048243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1048602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.1049034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1049757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.1050488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.1051054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.1051764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.1052460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.1053072Z     kernel = self.compile(
2025-05-07T20:32:59.1053639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.1054324Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.1054830Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1055070Z 
2025-05-07T20:32:59.1055297Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a6ac2be0>
2025-05-07T20:32:59.1056421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.1057850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6119260>}
2025-05-07T20:32:59.1059246Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.1060315Z context = <triton._C.libtriton.ir.context object at 0x7ff781a484f0>
2025-05-07T20:32:59.1060617Z 
2025-05-07T20:32:59.1060801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.1061339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.1061827Z                            module_map=module_map)
2025-05-07T20:32:59.1062215Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.1062587Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.1062856Z E       ^
2025-05-07T20:32:59.1063349Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.1063822Z 
2025-05-07T20:32:59.1064262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.1064795Z 
2025-05-07T20:32:59.1064909Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.1065342Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.1065767Z     T=128,
2025-05-07T20:32:59.1065969Z     D=7168,
2025-05-07T20:32:59.1066166Z     scale_ub=None,
2025-05-07T20:32:59.1066395Z     contiguous=False,
2025-05-07T20:32:59.1066633Z     compiled=True,
2025-05-07T20:32:59.1066841Z )
2025-05-07T20:32:59.1067178Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.1067702Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:59.1067979Z 
2025-05-07T20:32:59.1068059Z     @given(
2025-05-07T20:32:59.1068308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.1068638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.1068968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.1069309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.1069655Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.1069960Z     )
2025-05-07T20:32:59.1070370Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.1070837Z     def test_silu_mul_quant(
2025-05-07T20:32:59.1071096Z         self,
2025-05-07T20:32:59.1071298Z         T: int,
2025-05-07T20:32:59.1071508Z         D: int,
2025-05-07T20:32:59.1071742Z         scale_ub: Optional[float],
2025-05-07T20:32:59.1072027Z         contiguous: bool,
2025-05-07T20:32:59.1072284Z         compiled: bool,
2025-05-07T20:32:59.1072520Z     ) -> None:
2025-05-07T20:32:59.1072740Z         torch.manual_seed(2025)
2025-05-07T20:32:59.1073040Z     
2025-05-07T20:32:59.1073330Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.1073680Z     
2025-05-07T20:32:59.1073889Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.1074195Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.1074520Z         x = x_sign * x_clamp
2025-05-07T20:32:59.1074809Z         x0 = x[:, :D]
2025-05-07T20:32:59.1075043Z         x1 = x[:, D:]
2025-05-07T20:32:59.1075263Z     
2025-05-07T20:32:59.1075496Z         if contiguous:
2025-05-07T20:32:59.1075746Z             x0 = x0.contiguous()
2025-05-07T20:32:59.1076022Z             x1 = x1.contiguous()
2025-05-07T20:32:59.1076268Z     
2025-05-07T20:32:59.1076469Z         if scale_ub is not None:
2025-05-07T20:32:59.1076755Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.1077104Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.1077430Z             )
2025-05-07T20:32:59.1077630Z         else:
2025-05-07T20:32:59.1077846Z             scale_ub_tensor = None
2025-05-07T20:32:59.1078107Z     
2025-05-07T20:32:59.1078348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1078668Z             op = silu_mul_quant
2025-05-07T20:32:59.1078930Z             if compiled:
2025-05-07T20:32:59.1079191Z                 op = torch.compile(op)
2025-05-07T20:32:59.1079510Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1079796Z     
2025-05-07T20:32:59.1080002Z         y_fp8, y_scale = fn()
2025-05-07T20:32:59.1080374Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:59.1080671Z     
2025-05-07T20:32:59.1080922Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1081273Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:59.1081574Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:59.1081904Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:59.1082287Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.1082610Z     
2025-05-07T20:32:59.1082830Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:59.1083037Z 
2025-05-07T20:32:59.1083151Z moe/activation_test.py:126: 
2025-05-07T20:32:59.1083467Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1083824Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:59.1084173Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.1084995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:59.1085772Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:59.1086343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.1087055Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.1087772Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:59.1088523Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.1089295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:59.1090016Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:59.1090651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:59.1091183Z     fn()
2025-05-07T20:32:59.1091713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:59.1092343Z     self.fn.run(
2025-05-07T20:32:59.1092854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.1093409Z     kernel = self.compile(
2025-05-07T20:32:59.1094024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.1094707Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.1095118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1095402Z 
2025-05-07T20:32:59.1095623Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a61c99d0>
2025-05-07T20:32:59.1096787Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.1098217Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a611b420>}
2025-05-07T20:32:59.1099602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.1100665Z context = <triton._C.libtriton.ir.context object at 0x7ff7a6217bf0>
2025-05-07T20:32:59.1100970Z 
2025-05-07T20:32:59.1101144Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.1101698Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.1102178Z                            module_map=module_map)
2025-05-07T20:32:59.1102564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.1102941Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:59.1103217Z E       ^
2025-05-07T20:32:59.1103698Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.1104175Z 
2025-05-07T20:32:59.1104605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3488959Z 
2025-05-07T20:32:59.3489483Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3490068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3490627Z     T=128,
2025-05-07T20:32:59.3490836Z     D=7168,
2025-05-07T20:32:59.3491054Z     scale_ub=None,
2025-05-07T20:32:59.3491289Z     contiguous=False,
2025-05-07T20:32:59.3491534Z     compiled=False,
2025-05-07T20:32:59.3491758Z )
2025-05-07T20:32:59.3492102Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3492618Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:59.3492907Z 
2025-05-07T20:32:59.3492989Z     @given(
2025-05-07T20:32:59.3493233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3493559Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3493888Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3494242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3494591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3494886Z     )
2025-05-07T20:32:59.3495255Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3495725Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3496281Z         self,
2025-05-07T20:32:59.3496497Z         T: int,
2025-05-07T20:32:59.3496707Z         D: int,
2025-05-07T20:32:59.3496932Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3497222Z         contiguous: bool,
2025-05-07T20:32:59.3497476Z         compiled: bool,
2025-05-07T20:32:59.3497708Z     ) -> None:
2025-05-07T20:32:59.3497934Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3498190Z     
2025-05-07T20:32:59.3498471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3498828Z     
2025-05-07T20:32:59.3499135Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3499435Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3499760Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3500012Z         x0 = x[:, :D]
2025-05-07T20:32:59.3500239Z         x1 = x[:, D:]
2025-05-07T20:32:59.3500455Z     
2025-05-07T20:32:59.3500738Z         if contiguous:
2025-05-07T20:32:59.3500980Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3501342Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3501598Z     
2025-05-07T20:32:59.3501798Z         if scale_ub is not None:
2025-05-07T20:32:59.3502102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3502452Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3502777Z             )
2025-05-07T20:32:59.3502981Z         else:
2025-05-07T20:32:59.3503200Z             scale_ub_tensor = None
2025-05-07T20:32:59.3503465Z     
2025-05-07T20:32:59.3503712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3504042Z             op = silu_mul_quant
2025-05-07T20:32:59.3504306Z             if compiled:
2025-05-07T20:32:59.3504569Z                 op = torch.compile(op)
2025-05-07T20:32:59.3504888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3505174Z     
2025-05-07T20:32:59.3505378Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.3505556Z 
2025-05-07T20:32:59.3505673Z moe/activation_test.py:117: 
2025-05-07T20:32:59.3505985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3506336Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.3506632Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3507357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.3508083Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.3508654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3509379Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3510073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3510634Z     kernel = self.compile(
2025-05-07T20:32:59.3511208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3511894Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3512319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3512608Z 
2025-05-07T20:32:59.3512836Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a6898f50>
2025-05-07T20:32:59.3514444Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3515915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c60a40>}
2025-05-07T20:32:59.3517400Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3518475Z context = <triton._C.libtriton.ir.context object at 0x7ff7a624b2f0>
2025-05-07T20:32:59.3518782Z 
2025-05-07T20:32:59.3518958Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3519509Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3520001Z                            module_map=module_map)
2025-05-07T20:32:59.3520496Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3520958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.3521230Z E       ^
2025-05-07T20:32:59.3521718Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3522269Z 
2025-05-07T20:32:59.3522818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3523570Z 
2025-05-07T20:32:59.3523777Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3524267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3524695Z     T=4096,
2025-05-07T20:32:59.3524895Z     D=5120,
2025-05-07T20:32:59.3525096Z     scale_ub=1200.0,
2025-05-07T20:32:59.3525338Z     contiguous=True,
2025-05-07T20:32:59.3525576Z     compiled=False,
2025-05-07T20:32:59.3525786Z )
2025-05-07T20:32:59.3526126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3526664Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.3526951Z 
2025-05-07T20:32:59.3527043Z     @given(
2025-05-07T20:32:59.3527282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3527618Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3527956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3528301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3528653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3528959Z     )
2025-05-07T20:32:59.3529323Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3529793Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3530051Z         self,
2025-05-07T20:32:59.3530256Z         T: int,
2025-05-07T20:32:59.3530474Z         D: int,
2025-05-07T20:32:59.3530715Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3531008Z         contiguous: bool,
2025-05-07T20:32:59.3531264Z         compiled: bool,
2025-05-07T20:32:59.3531505Z     ) -> None:
2025-05-07T20:32:59.3531735Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3531989Z     
2025-05-07T20:32:59.3532279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3532646Z     
2025-05-07T20:32:59.3532852Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3533174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3533506Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3533760Z         x0 = x[:, :D]
2025-05-07T20:32:59.3533996Z         x1 = x[:, D:]
2025-05-07T20:32:59.3534218Z     
2025-05-07T20:32:59.3534412Z         if contiguous:
2025-05-07T20:32:59.3534658Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3534937Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3535194Z     
2025-05-07T20:32:59.3535400Z         if scale_ub is not None:
2025-05-07T20:32:59.3535691Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3536053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3536381Z             )
2025-05-07T20:32:59.3536588Z         else:
2025-05-07T20:32:59.3536812Z             scale_ub_tensor = None
2025-05-07T20:32:59.3537074Z     
2025-05-07T20:32:59.3537327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3537666Z             op = silu_mul_quant
2025-05-07T20:32:59.3537980Z             if compiled:
2025-05-07T20:32:59.3538251Z                 op = torch.compile(op)
2025-05-07T20:32:59.3538570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3538859Z     
2025-05-07T20:32:59.3539072Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.3539247Z 
2025-05-07T20:32:59.3539358Z moe/activation_test.py:117: 
2025-05-07T20:32:59.3539671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3540028Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.3540330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3541106Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.3541829Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.3542426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3543289Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3543991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3544558Z     kernel = self.compile(
2025-05-07T20:32:59.3545138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3545836Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3546258Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3546511Z 
2025-05-07T20:32:59.3546730Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a689b750>
2025-05-07T20:32:59.3547872Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3549322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c60ea0>}
2025-05-07T20:32:59.3550727Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3551790Z context = <triton._C.libtriton.ir.context object at 0x7ff7a629b730>
2025-05-07T20:32:59.3552098Z 
2025-05-07T20:32:59.3552277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3552826Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3553314Z                            module_map=module_map)
2025-05-07T20:32:59.3553697Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3554072Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.3554349Z E       ^
2025-05-07T20:32:59.3554833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3555313Z 
2025-05-07T20:32:59.3555750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.3556286Z 
2025-05-07T20:32:59.3556401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.3556840Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.3557260Z     T=1,
2025-05-07T20:32:59.3557456Z     D=5120,
2025-05-07T20:32:59.3557661Z     scale_ub=None,
2025-05-07T20:32:59.3557883Z     contiguous=True,
2025-05-07T20:32:59.3558118Z     compiled=True,
2025-05-07T20:32:59.3558336Z )
2025-05-07T20:32:59.3558670Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.3559230Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:59.3559502Z 
2025-05-07T20:32:59.3559593Z     @given(
2025-05-07T20:32:59.3559831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.3560284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.3560609Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.3560962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.3561305Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.3561606Z     )
2025-05-07T20:32:59.3561976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.3562484Z     def test_silu_mul_quant(
2025-05-07T20:32:59.3562743Z         self,
2025-05-07T20:32:59.3562951Z         T: int,
2025-05-07T20:32:59.3563153Z         D: int,
2025-05-07T20:32:59.3563384Z         scale_ub: Optional[float],
2025-05-07T20:32:59.3563675Z         contiguous: bool,
2025-05-07T20:32:59.3563969Z         compiled: bool,
2025-05-07T20:32:59.3564206Z     ) -> None:
2025-05-07T20:32:59.3564477Z         torch.manual_seed(2025)
2025-05-07T20:32:59.3564729Z     
2025-05-07T20:32:59.3565017Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.3565378Z     
2025-05-07T20:32:59.3565578Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.3565889Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.3566218Z         x = x_sign * x_clamp
2025-05-07T20:32:59.3566478Z         x0 = x[:, :D]
2025-05-07T20:32:59.3566703Z         x1 = x[:, D:]
2025-05-07T20:32:59.3566924Z     
2025-05-07T20:32:59.3567124Z         if contiguous:
2025-05-07T20:32:59.3567363Z             x0 = x0.contiguous()
2025-05-07T20:32:59.3567637Z             x1 = x1.contiguous()
2025-05-07T20:32:59.3567894Z     
2025-05-07T20:32:59.3568095Z         if scale_ub is not None:
2025-05-07T20:32:59.3568384Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.3568739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.3569064Z             )
2025-05-07T20:32:59.3569271Z         else:
2025-05-07T20:32:59.3569494Z             scale_ub_tensor = None
2025-05-07T20:32:59.3569753Z     
2025-05-07T20:32:59.3569998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3570338Z             op = silu_mul_quant
2025-05-07T20:32:59.3570602Z             if compiled:
2025-05-07T20:32:59.3570863Z                 op = torch.compile(op)
2025-05-07T20:32:59.3571179Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.3571472Z     
2025-05-07T20:32:59.3571669Z         y_fp8, y_scale = fn()
2025-05-07T20:32:59.3571976Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:59.3572290Z     
2025-05-07T20:32:59.3572535Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.3572889Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:59.3573203Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:59.3573534Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:59.3573915Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.3574243Z     
2025-05-07T20:32:59.3574450Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:59.3574660Z 
2025-05-07T20:32:59.3574765Z moe/activation_test.py:126: 
2025-05-07T20:32:59.3575076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3575432Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:59.3575776Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:59.3576605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:59.3577390Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:59.3577955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.3586111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.3586867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:59.3587637Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:59.3588421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:59.3589104Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:59.3589731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:59.3590331Z     fn()
2025-05-07T20:32:59.3590870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:59.3591475Z     self.fn.run(
2025-05-07T20:32:59.3592023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.3592625Z     kernel = self.compile(
2025-05-07T20:32:59.3593198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.3593876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.3594301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.3594544Z 
2025-05-07T20:32:59.3594770Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a64caa80>
2025-05-07T20:32:59.3595898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.3597324Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c62c00>}
2025-05-07T20:32:59.3598721Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.3599789Z context = <triton._C.libtriton.ir.context object at 0x7ff7a629ea30>
2025-05-07T20:32:59.3600188Z 
2025-05-07T20:32:59.3600368Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.3600910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.3601401Z                            module_map=module_map)
2025-05-07T20:32:59.3601788Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.3602163Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:59.3602434Z E       ^
2025-05-07T20:32:59.3602918Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.3603386Z 
2025-05-07T20:32:59.3603827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.0568056Z 
2025-05-07T20:33:00.0568396Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0569016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0569577Z     T=2048,
2025-05-07T20:33:00.0569841Z     D=5120,
2025-05-07T20:33:00.0570110Z     scale_ub=None,
2025-05-07T20:33:00.0570400Z     contiguous=True,
2025-05-07T20:33:00.0570697Z     compiled=True,
2025-05-07T20:33:00.0570923Z )
2025-05-07T20:33:00.0571259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0571790Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.0572082Z 
2025-05-07T20:33:00.0572172Z     @given(
2025-05-07T20:33:00.0572425Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0573078Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0573419Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0573776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0574123Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0574430Z     )
2025-05-07T20:33:00.0574808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0575273Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0575535Z         self,
2025-05-07T20:33:00.0575840Z         T: int,
2025-05-07T20:33:00.0576087Z         D: int,
2025-05-07T20:33:00.0576323Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0576610Z         contiguous: bool,
2025-05-07T20:33:00.0576872Z         compiled: bool,
2025-05-07T20:33:00.0577125Z     ) -> None:
2025-05-07T20:33:00.0577356Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0577702Z     
2025-05-07T20:33:00.0577999Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0578436Z     
2025-05-07T20:33:00.0578649Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.0578963Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.0579300Z         x = x_sign * x_clamp
2025-05-07T20:33:00.0579550Z         x0 = x[:, :D]
2025-05-07T20:33:00.0579786Z         x1 = x[:, D:]
2025-05-07T20:33:00.0580007Z     
2025-05-07T20:33:00.0580202Z         if contiguous:
2025-05-07T20:33:00.0580451Z             x0 = x0.contiguous()
2025-05-07T20:33:00.0580732Z             x1 = x1.contiguous()
2025-05-07T20:33:00.0580989Z     
2025-05-07T20:33:00.0581200Z         if scale_ub is not None:
2025-05-07T20:33:00.0581495Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.0581848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.0582184Z             )
2025-05-07T20:33:00.0582401Z         else:
2025-05-07T20:33:00.0582631Z             scale_ub_tensor = None
2025-05-07T20:33:00.0582906Z     
2025-05-07T20:33:00.0583163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0583494Z             op = silu_mul_quant
2025-05-07T20:33:00.0583768Z             if compiled:
2025-05-07T20:33:00.0584036Z                 op = torch.compile(op)
2025-05-07T20:33:00.0584347Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0584646Z     
2025-05-07T20:33:00.0584859Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.0585174Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.0585481Z     
2025-05-07T20:33:00.0585740Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0586099Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.0586409Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.0586746Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.0587135Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.0587470Z     
2025-05-07T20:33:00.0587698Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.0587904Z 
2025-05-07T20:33:00.0588022Z moe/activation_test.py:126: 
2025-05-07T20:33:00.0588342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0588699Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.0589052Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.0589890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.0590680Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.0591262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.0591987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.0592778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.0593541Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.0594313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.0594993Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.0595635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.0596179Z     fn()
2025-05-07T20:33:00.0596764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.0597380Z     self.fn.run(
2025-05-07T20:33:00.0597871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.0598477Z     kernel = self.compile(
2025-05-07T20:33:00.0599092Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.0599785Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.0600354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0600602Z 
2025-05-07T20:33:00.0600822Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a64cab70>
2025-05-07T20:33:00.0601956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.0603406Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c7d1c0>}
2025-05-07T20:33:00.0604803Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.0605873Z context = <triton._C.libtriton.ir.context object at 0x7ff7a60a7030>
2025-05-07T20:33:00.0606187Z 
2025-05-07T20:33:00.0606366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.0606919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.0607410Z                            module_map=module_map)
2025-05-07T20:33:00.0607803Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.0608191Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.0608473Z E       ^
2025-05-07T20:33:00.0608966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.0609444Z 
2025-05-07T20:33:00.0609885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.0610426Z 
2025-05-07T20:33:00.0610546Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0610983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0611410Z     T=128,
2025-05-07T20:33:00.0611617Z     D=5120,
2025-05-07T20:33:00.0611821Z     scale_ub=None,
2025-05-07T20:33:00.0612054Z     contiguous=True,
2025-05-07T20:33:00.0612297Z     compiled=True,
2025-05-07T20:33:00.0612516Z )
2025-05-07T20:33:00.0612855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0613706Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.0614088Z 
2025-05-07T20:33:00.0614185Z     @given(
2025-05-07T20:33:00.0614428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0614765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0615108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0615562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0615933Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0616246Z     )
2025-05-07T20:33:00.0616617Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0617088Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0617355Z         self,
2025-05-07T20:33:00.0617568Z         T: int,
2025-05-07T20:33:00.0617777Z         D: int,
2025-05-07T20:33:00.0618017Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0618315Z         contiguous: bool,
2025-05-07T20:33:00.0618644Z         compiled: bool,
2025-05-07T20:33:00.0618880Z     ) -> None:
2025-05-07T20:33:00.0619114Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0619374Z     
2025-05-07T20:33:00.0619661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0620126Z     
2025-05-07T20:33:00.0620337Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.0620705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.0621040Z         x = x_sign * x_clamp
2025-05-07T20:33:00.0621298Z         x0 = x[:, :D]
2025-05-07T20:33:00.0621525Z         x1 = x[:, D:]
2025-05-07T20:33:00.0621754Z     
2025-05-07T20:33:00.0621956Z         if contiguous:
2025-05-07T20:33:00.0622200Z             x0 = x0.contiguous()
2025-05-07T20:33:00.0622486Z             x1 = x1.contiguous()
2025-05-07T20:33:00.0622745Z     
2025-05-07T20:33:00.0622946Z         if scale_ub is not None:
2025-05-07T20:33:00.0623239Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.0623595Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.0623925Z             )
2025-05-07T20:33:00.0624123Z         else:
2025-05-07T20:33:00.0624348Z             scale_ub_tensor = None
2025-05-07T20:33:00.0624613Z     
2025-05-07T20:33:00.0624854Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0625192Z             op = silu_mul_quant
2025-05-07T20:33:00.0625462Z             if compiled:
2025-05-07T20:33:00.0625722Z                 op = torch.compile(op)
2025-05-07T20:33:00.0626041Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0626338Z     
2025-05-07T20:33:00.0626540Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.0626849Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.0627156Z     
2025-05-07T20:33:00.0627428Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0627782Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.0628095Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.0628432Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.0628815Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.0629142Z     
2025-05-07T20:33:00.0629362Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.0629572Z 
2025-05-07T20:33:00.0629686Z moe/activation_test.py:126: 
2025-05-07T20:33:00.0630002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0630360Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.0630713Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.0631537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.0632318Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.0632897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.0633622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.0634346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.0635094Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.0635915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.0636588Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.0637216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.0637760Z     fn()
2025-05-07T20:33:00.0638292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.0638899Z     self.fn.run(
2025-05-07T20:33:00.0639428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.0639983Z     kernel = self.compile(
2025-05-07T20:33:00.0640645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.0641376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.0641838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0642086Z 
2025-05-07T20:33:00.0642304Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a60e5ef0>
2025-05-07T20:33:00.0643441Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.0644862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78162ad40>}
2025-05-07T20:33:00.0646255Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.0647328Z context = <triton._C.libtriton.ir.context object at 0x7ff781b6adf0>
2025-05-07T20:33:00.0647633Z 
2025-05-07T20:33:00.0647820Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.0648373Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.0648860Z                            module_map=module_map)
2025-05-07T20:33:00.0649246Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.0649624Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.0649902Z E       ^
2025-05-07T20:33:00.0650393Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.0650861Z 
2025-05-07T20:33:00.0651304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.8504894Z 
2025-05-07T20:33:00.8505768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.8506347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.8506779Z     T=4096,
2025-05-07T20:33:00.8506985Z     D=5120,
2025-05-07T20:33:00.8507189Z     scale_ub=None,
2025-05-07T20:33:00.8507410Z     contiguous=True,
2025-05-07T20:33:00.8507650Z     compiled=True,
2025-05-07T20:33:00.8507866Z )
2025-05-07T20:33:00.8508197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.8508717Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.8509004Z 
2025-05-07T20:33:00.8509096Z     @given(
2025-05-07T20:33:00.8509338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.8509660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.8509986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.8510334Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.8510681Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.8511191Z     )
2025-05-07T20:33:00.8511569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.8512027Z     def test_silu_mul_quant(
2025-05-07T20:33:00.8512282Z         self,
2025-05-07T20:33:00.8512489Z         T: int,
2025-05-07T20:33:00.8512691Z         D: int,
2025-05-07T20:33:00.8512923Z         scale_ub: Optional[float],
2025-05-07T20:33:00.8513212Z         contiguous: bool,
2025-05-07T20:33:00.8513658Z         compiled: bool,
2025-05-07T20:33:00.8513912Z     ) -> None:
2025-05-07T20:33:00.8514141Z         torch.manual_seed(2025)
2025-05-07T20:33:00.8514481Z     
2025-05-07T20:33:00.8514766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.8515130Z     
2025-05-07T20:33:00.8515337Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.8515642Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.8516052Z         x = x_sign * x_clamp
2025-05-07T20:33:00.8516310Z         x0 = x[:, :D]
2025-05-07T20:33:00.8516603Z         x1 = x[:, D:]
2025-05-07T20:33:00.8516822Z     
2025-05-07T20:33:00.8517016Z         if contiguous:
2025-05-07T20:33:00.8517252Z             x0 = x0.contiguous()
2025-05-07T20:33:00.8517525Z             x1 = x1.contiguous()
2025-05-07T20:33:00.8517776Z     
2025-05-07T20:33:00.8517970Z         if scale_ub is not None:
2025-05-07T20:33:00.8518262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.8518615Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.8518941Z             )
2025-05-07T20:33:00.8519142Z         else:
2025-05-07T20:33:00.8519363Z             scale_ub_tensor = None
2025-05-07T20:33:00.8519628Z     
2025-05-07T20:33:00.8519869Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8520290Z             op = silu_mul_quant
2025-05-07T20:33:00.8520553Z             if compiled:
2025-05-07T20:33:00.8520810Z                 op = torch.compile(op)
2025-05-07T20:33:00.8521124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8521416Z     
2025-05-07T20:33:00.8521614Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.8521914Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.8522220Z     
2025-05-07T20:33:00.8522466Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8522816Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.8523163Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.8523494Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.8523875Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.8524199Z     
2025-05-07T20:33:00.8524412Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.8524622Z 
2025-05-07T20:33:00.8524731Z moe/activation_test.py:126: 
2025-05-07T20:33:00.8525050Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8525407Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.8525762Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.8526591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.8527375Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.8527950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.8528667Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.8529392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.8530144Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.8530913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.8531692Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.8532331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.8532866Z     fn()
2025-05-07T20:33:00.8533400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.8534011Z     self.fn.run(
2025-05-07T20:33:00.8534493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.8535097Z     kernel = self.compile(
2025-05-07T20:33:00.8535667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.8536353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.8536764Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8537054Z 
2025-05-07T20:33:00.8537310Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781ba6b30>
2025-05-07T20:33:00.8538440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.8539897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7815ea660>}
2025-05-07T20:33:00.8541291Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.8542353Z context = <triton._C.libtriton.ir.context object at 0x7ff7817c7bf0>
2025-05-07T20:33:00.8542662Z 
2025-05-07T20:33:00.8542840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.8543394Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.8543930Z                            module_map=module_map)
2025-05-07T20:33:00.8544313Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.8544688Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.8544964Z E       ^
2025-05-07T20:33:00.8545450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.8545929Z 
2025-05-07T20:33:00.8546365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.8546901Z 
2025-05-07T20:33:00.8547019Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.8547448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.8547873Z     T=16384,
2025-05-07T20:33:00.8548078Z     D=5120,
2025-05-07T20:33:00.8548279Z     scale_ub=None,
2025-05-07T20:33:00.8548504Z     contiguous=True,
2025-05-07T20:33:00.8548736Z     compiled=True,
2025-05-07T20:33:00.8548949Z )
2025-05-07T20:33:00.8549279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.8549798Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.8550084Z 
2025-05-07T20:33:00.8550172Z     @given(
2025-05-07T20:33:00.8550410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.8550742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.8551072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.8551413Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.8551764Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.8552067Z     )
2025-05-07T20:33:00.8552439Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.8552945Z     def test_silu_mul_quant(
2025-05-07T20:33:00.8553208Z         self,
2025-05-07T20:33:00.8553416Z         T: int,
2025-05-07T20:33:00.8553624Z         D: int,
2025-05-07T20:33:00.8553896Z         scale_ub: Optional[float],
2025-05-07T20:33:00.8554189Z         contiguous: bool,
2025-05-07T20:33:00.8554437Z         compiled: bool,
2025-05-07T20:33:00.8554675Z     ) -> None:
2025-05-07T20:33:00.8554900Z         torch.manual_seed(2025)
2025-05-07T20:33:00.8555147Z     
2025-05-07T20:33:00.8555441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.8555841Z     
2025-05-07T20:33:00.8556038Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.8556344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.8556668Z         x = x_sign * x_clamp
2025-05-07T20:33:00.8556913Z         x0 = x[:, :D]
2025-05-07T20:33:00.8557139Z         x1 = x[:, D:]
2025-05-07T20:33:00.8557397Z     
2025-05-07T20:33:00.8557586Z         if contiguous:
2025-05-07T20:33:00.8557830Z             x0 = x0.contiguous()
2025-05-07T20:33:00.8558141Z             x1 = x1.contiguous()
2025-05-07T20:33:00.8558395Z     
2025-05-07T20:33:00.8558590Z         if scale_ub is not None:
2025-05-07T20:33:00.8558880Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.8559234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.8566291Z             )
2025-05-07T20:33:00.8566543Z         else:
2025-05-07T20:33:00.8566764Z             scale_ub_tensor = None
2025-05-07T20:33:00.8567039Z     
2025-05-07T20:33:00.8567294Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8567630Z             op = silu_mul_quant
2025-05-07T20:33:00.8567890Z             if compiled:
2025-05-07T20:33:00.8568157Z                 op = torch.compile(op)
2025-05-07T20:33:00.8568474Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.8568760Z     
2025-05-07T20:33:00.8568967Z         y_fp8, y_scale = fn()
2025-05-07T20:33:00.8569278Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:00.8569584Z     
2025-05-07T20:33:00.8569836Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.8570189Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:00.8570490Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:00.8570822Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:00.8571203Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.8571524Z     
2025-05-07T20:33:00.8571740Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:00.8571955Z 
2025-05-07T20:33:00.8572064Z moe/activation_test.py:126: 
2025-05-07T20:33:00.8572383Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8572730Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:00.8573078Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:00.8573912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:00.8574685Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:00.8575254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.8575967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.8576682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:00.8577436Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:00.8578200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:00.8578868Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:00.8579580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:00.8580119Z     fn()
2025-05-07T20:33:00.8580655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:00.8581266Z     self.fn.run(
2025-05-07T20:33:00.8581746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.8582300Z     kernel = self.compile(
2025-05-07T20:33:00.8582868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.8583640Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.8584052Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.8584298Z 
2025-05-07T20:33:00.8584514Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781b836c0>
2025-05-07T20:33:00.8585716Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.8587150Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780b25800>}
2025-05-07T20:33:00.8588534Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.8589601Z context = <triton._C.libtriton.ir.context object at 0x7ff7812b15f0>
2025-05-07T20:33:00.8589909Z 
2025-05-07T20:33:00.8590085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.8590633Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.8591125Z                            module_map=module_map)
2025-05-07T20:33:00.8591511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.8591891Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:00.8592178Z E       ^
2025-05-07T20:33:00.8592656Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.8593162Z 
2025-05-07T20:33:00.8593614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.8782116Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:00.8783754Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:00.8785149Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:00.8786196Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:00.8787364Z W0507 20:33:00.876000 228969 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:01.2775612Z 
2025-05-07T20:33:01.2775930Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2776409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2776889Z     T=1,
2025-05-07T20:33:01.2777097Z     D=5120,
2025-05-07T20:33:01.2777302Z     scale_ub=1200.0,
2025-05-07T20:33:01.2777552Z     contiguous=True,
2025-05-07T20:33:01.2777911Z     compiled=True,
2025-05-07T20:33:01.2778132Z )
2025-05-07T20:33:01.2778512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2779092Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.2779403Z 
2025-05-07T20:33:01.2779497Z     @given(
2025-05-07T20:33:01.2779748Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2780114Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2780470Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2780854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2781302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2781632Z     )
2025-05-07T20:33:01.2782041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2782576Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2782917Z         self,
2025-05-07T20:33:01.2783129Z         T: int,
2025-05-07T20:33:01.2783352Z         D: int,
2025-05-07T20:33:01.2783658Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2783948Z         contiguous: bool,
2025-05-07T20:33:01.2784215Z         compiled: bool,
2025-05-07T20:33:01.2784461Z     ) -> None:
2025-05-07T20:33:01.2784690Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2784951Z     
2025-05-07T20:33:01.2785249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2785617Z     
2025-05-07T20:33:01.2785822Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2786139Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2786476Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2786730Z         x0 = x[:, :D]
2025-05-07T20:33:01.2786967Z         x1 = x[:, D:]
2025-05-07T20:33:01.2787194Z     
2025-05-07T20:33:01.2787390Z         if contiguous:
2025-05-07T20:33:01.2787643Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2787929Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2788184Z     
2025-05-07T20:33:01.2788398Z         if scale_ub is not None:
2025-05-07T20:33:01.2788704Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2789059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2789391Z             )
2025-05-07T20:33:01.2789602Z         else:
2025-05-07T20:33:01.2789824Z             scale_ub_tensor = None
2025-05-07T20:33:01.2790097Z     
2025-05-07T20:33:01.2790347Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2790685Z             op = silu_mul_quant
2025-05-07T20:33:01.2790948Z             if compiled:
2025-05-07T20:33:01.2791227Z                 op = torch.compile(op)
2025-05-07T20:33:01.2791549Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2791842Z     
2025-05-07T20:33:01.2792052Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.2792227Z 
2025-05-07T20:33:01.2792344Z moe/activation_test.py:117: 
2025-05-07T20:33:01.2792665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2793051Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.2793385Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2793982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.2794579Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.2795280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.2796012Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.2796578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2797307Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2798013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2798581Z     kernel = self.compile(
2025-05-07T20:33:01.2799203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2799904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2800421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2800665Z 
2025-05-07T20:33:01.2800884Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78175c410>
2025-05-07T20:33:01.2802027Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2803528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780b334c0>}
2025-05-07T20:33:01.2805014Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2806092Z context = <triton._C.libtriton.ir.context object at 0x7ff7812ee770>
2025-05-07T20:33:01.2806398Z 
2025-05-07T20:33:01.2806577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2807137Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2807638Z                            module_map=module_map)
2025-05-07T20:33:01.2808040Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2808415Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.2808698Z E       ^
2025-05-07T20:33:01.2809223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2809729Z 
2025-05-07T20:33:01.2810173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2810718Z 
2025-05-07T20:33:01.2810831Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2811278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2811710Z     T=1,
2025-05-07T20:33:01.2811909Z     D=5120,
2025-05-07T20:33:01.2812126Z     scale_ub=None,
2025-05-07T20:33:01.2812363Z     contiguous=False,
2025-05-07T20:33:01.2812610Z     compiled=True,
2025-05-07T20:33:01.2812833Z )
2025-05-07T20:33:01.2813179Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2813857Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.2814133Z 
2025-05-07T20:33:01.2814218Z     @given(
2025-05-07T20:33:01.2814609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2814949Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2815281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2815640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2815995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2816296Z     )
2025-05-07T20:33:01.2816671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2817142Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2817405Z         self,
2025-05-07T20:33:01.2817610Z         T: int,
2025-05-07T20:33:01.2817826Z         D: int,
2025-05-07T20:33:01.2818066Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2818352Z         contiguous: bool,
2025-05-07T20:33:01.2818612Z         compiled: bool,
2025-05-07T20:33:01.2818852Z     ) -> None:
2025-05-07T20:33:01.2819076Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2819335Z     
2025-05-07T20:33:01.2819626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2819986Z     
2025-05-07T20:33:01.2820271Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2820585Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2820916Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2821179Z         x0 = x[:, :D]
2025-05-07T20:33:01.2821413Z         x1 = x[:, D:]
2025-05-07T20:33:01.2821632Z     
2025-05-07T20:33:01.2821833Z         if contiguous:
2025-05-07T20:33:01.2822084Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2822363Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2822616Z     
2025-05-07T20:33:01.2822827Z         if scale_ub is not None:
2025-05-07T20:33:01.2823187Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2823553Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2823879Z             )
2025-05-07T20:33:01.2824089Z         else:
2025-05-07T20:33:01.2824315Z             scale_ub_tensor = None
2025-05-07T20:33:01.2824666Z     
2025-05-07T20:33:01.2824915Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2825308Z             op = silu_mul_quant
2025-05-07T20:33:01.2825577Z             if compiled:
2025-05-07T20:33:01.2825848Z                 op = torch.compile(op)
2025-05-07T20:33:01.2826171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2826460Z     
2025-05-07T20:33:01.2826669Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.2826976Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.2827294Z     
2025-05-07T20:33:01.2827544Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2827908Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.2828232Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.2828563Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.2828955Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.2829288Z     
2025-05-07T20:33:01.2829504Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.2829715Z 
2025-05-07T20:33:01.2829825Z moe/activation_test.py:126: 
2025-05-07T20:33:01.2830147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2830509Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.2830858Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.2831692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.2832485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.2833093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2833847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2834580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.2835354Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.2836125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.2836810Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.2837456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.2838006Z     fn()
2025-05-07T20:33:01.2838544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.2839173Z     self.fn.run(
2025-05-07T20:33:01.2839672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2840321Z     kernel = self.compile(
2025-05-07T20:33:01.2840901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2841662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2842093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2842338Z 
2025-05-07T20:33:01.2842560Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78175eb10>
2025-05-07T20:33:01.2843745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2845228Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780702de0>}
2025-05-07T20:33:01.2846644Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2847789Z context = <triton._C.libtriton.ir.context object at 0x7ff7804f8730>
2025-05-07T20:33:01.2848107Z 
2025-05-07T20:33:01.2848288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2848844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2849342Z                            module_map=module_map)
2025-05-07T20:33:01.2849728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2850114Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.2850406Z E       ^
2025-05-07T20:33:01.2850893Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2851374Z 
2025-05-07T20:33:01.2851811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.4262890Z 
2025-05-07T20:33:01.4263244Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.4263700Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.4264152Z     T=1,
2025-05-07T20:33:01.4264345Z     D=5120,
2025-05-07T20:33:01.4264552Z     scale_ub=None,
2025-05-07T20:33:01.4264783Z     contiguous=True,
2025-05-07T20:33:01.4265017Z     compiled=False,
2025-05-07T20:33:01.4265240Z )
2025-05-07T20:33:01.4265694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.4266378Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.4266741Z 
2025-05-07T20:33:01.4266846Z     @given(
2025-05-07T20:33:01.4267154Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.4267498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.4267826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.4268184Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.4268542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.4268839Z     )
2025-05-07T20:33:01.4269207Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.4269673Z     def test_silu_mul_quant(
2025-05-07T20:33:01.4269923Z         self,
2025-05-07T20:33:01.4270129Z         T: int,
2025-05-07T20:33:01.4270338Z         D: int,
2025-05-07T20:33:01.4270567Z         scale_ub: Optional[float],
2025-05-07T20:33:01.4270855Z         contiguous: bool,
2025-05-07T20:33:01.4271111Z         compiled: bool,
2025-05-07T20:33:01.4271385Z     ) -> None:
2025-05-07T20:33:01.4271611Z         torch.manual_seed(2025)
2025-05-07T20:33:01.4271861Z     
2025-05-07T20:33:01.4272149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.4272510Z     
2025-05-07T20:33:01.4272712Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.4273024Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.4273472Z         x = x_sign * x_clamp
2025-05-07T20:33:01.4273729Z         x0 = x[:, :D]
2025-05-07T20:33:01.4273958Z         x1 = x[:, D:]
2025-05-07T20:33:01.4274179Z     
2025-05-07T20:33:01.4274370Z         if contiguous:
2025-05-07T20:33:01.4274612Z             x0 = x0.contiguous()
2025-05-07T20:33:01.4274885Z             x1 = x1.contiguous()
2025-05-07T20:33:01.4275133Z     
2025-05-07T20:33:01.4275335Z         if scale_ub is not None:
2025-05-07T20:33:01.4275627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.4275984Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.4276372Z             )
2025-05-07T20:33:01.4276578Z         else:
2025-05-07T20:33:01.4276800Z             scale_ub_tensor = None
2025-05-07T20:33:01.4277062Z     
2025-05-07T20:33:01.4277307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.4277641Z             op = silu_mul_quant
2025-05-07T20:33:01.4277965Z             if compiled:
2025-05-07T20:33:01.4278231Z                 op = torch.compile(op)
2025-05-07T20:33:01.4278595Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.4278882Z     
2025-05-07T20:33:01.4279088Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.4279268Z 
2025-05-07T20:33:01.4279385Z moe/activation_test.py:117: 
2025-05-07T20:33:01.4279693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4280044Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.4280452Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.4281174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.4281890Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.4282453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.4283169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.4283869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.4284422Z     kernel = self.compile(
2025-05-07T20:33:01.4284991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.4285680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.4286093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4286336Z 
2025-05-07T20:33:01.4286556Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780db6fc0>
2025-05-07T20:33:01.4287680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.4289118Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781029940>}
2025-05-07T20:33:01.4290517Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.4291583Z context = <triton._C.libtriton.ir.context object at 0x7ff7806ffcb0>
2025-05-07T20:33:01.4291892Z 
2025-05-07T20:33:01.4292070Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.4292617Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.4293110Z                            module_map=module_map)
2025-05-07T20:33:01.4293489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.4293860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.4294137Z E       ^
2025-05-07T20:33:01.4294671Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.4295148Z 
2025-05-07T20:33:01.4295583Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.4296123Z 
2025-05-07T20:33:01.4296236Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.4296672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.4297091Z     T=128,
2025-05-07T20:33:01.4297289Z     D=5120,
2025-05-07T20:33:01.4297534Z     scale_ub=None,
2025-05-07T20:33:01.4297756Z     contiguous=False,
2025-05-07T20:33:01.4297996Z     compiled=True,
2025-05-07T20:33:01.4298212Z )
2025-05-07T20:33:01.4298544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.4299062Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.4299386Z 
2025-05-07T20:33:01.4299476Z     @given(
2025-05-07T20:33:01.4299756Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.4300088Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.4300413Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.4300762Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.4301102Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.4301405Z     )
2025-05-07T20:33:01.4301773Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.4302234Z     def test_silu_mul_quant(
2025-05-07T20:33:01.4302491Z         self,
2025-05-07T20:33:01.4302698Z         T: int,
2025-05-07T20:33:01.4302901Z         D: int,
2025-05-07T20:33:01.4303131Z         scale_ub: Optional[float],
2025-05-07T20:33:01.4303416Z         contiguous: bool,
2025-05-07T20:33:01.4303666Z         compiled: bool,
2025-05-07T20:33:01.4303903Z     ) -> None:
2025-05-07T20:33:01.4304130Z         torch.manual_seed(2025)
2025-05-07T20:33:01.4304381Z     
2025-05-07T20:33:01.4304672Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.4305029Z     
2025-05-07T20:33:01.4305231Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.4305534Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.4305858Z         x = x_sign * x_clamp
2025-05-07T20:33:01.4306107Z         x0 = x[:, :D]
2025-05-07T20:33:01.4306333Z         x1 = x[:, D:]
2025-05-07T20:33:01.4306556Z     
2025-05-07T20:33:01.4306752Z         if contiguous:
2025-05-07T20:33:01.4306995Z             x0 = x0.contiguous()
2025-05-07T20:33:01.4307275Z             x1 = x1.contiguous()
2025-05-07T20:33:01.4307526Z     
2025-05-07T20:33:01.4307725Z         if scale_ub is not None:
2025-05-07T20:33:01.4308013Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.4308370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.4308694Z             )
2025-05-07T20:33:01.4308895Z         else:
2025-05-07T20:33:01.4309121Z             scale_ub_tensor = None
2025-05-07T20:33:01.4309382Z     
2025-05-07T20:33:01.4309628Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.4309959Z             op = silu_mul_quant
2025-05-07T20:33:01.4310218Z             if compiled:
2025-05-07T20:33:01.4316484Z                 op = torch.compile(op)
2025-05-07T20:33:01.4316848Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.4317143Z     
2025-05-07T20:33:01.4317357Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.4317535Z 
2025-05-07T20:33:01.4317648Z moe/activation_test.py:117: 
2025-05-07T20:33:01.4317962Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4318316Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.4318620Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.4319223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.4319924Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.4320686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.4321400Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.4321965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.4322684Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.4323427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.4324056Z     kernel = self.compile(
2025-05-07T20:33:01.4324627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.4325323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.4325808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.4326110Z 
2025-05-07T20:33:01.4326331Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78071ce10>
2025-05-07T20:33:01.4327464Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.4328900Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7807037e0>}
2025-05-07T20:33:01.4330301Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.4331367Z context = <triton._C.libtriton.ir.context object at 0x7ff78054c870>
2025-05-07T20:33:01.4331681Z 
2025-05-07T20:33:01.4331862Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.4332425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.4332920Z                            module_map=module_map)
2025-05-07T20:33:01.4333301Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.4333678Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.4333953Z E       ^
2025-05-07T20:33:01.4334435Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.4334911Z 
2025-05-07T20:33:01.4335346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.4335884Z 
2025-05-07T20:33:01.4335993Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.4336436Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.4336856Z     T=128,
2025-05-07T20:33:01.4337062Z     D=7168,
2025-05-07T20:33:01.4337262Z     scale_ub=1200.0,
2025-05-07T20:33:01.4337494Z     contiguous=False,
2025-05-07T20:33:01.4337735Z     compiled=False,
2025-05-07T20:33:01.5901431Z )
2025-05-07T20:33:01.5902437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.5903394Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.5904030Z 
2025-05-07T20:33:01.5904194Z     @given(
2025-05-07T20:33:01.5904686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.5905360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.5905994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.5906672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.5907350Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.5907955Z     )
2025-05-07T20:33:01.5909061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.5909985Z     def test_silu_mul_quant(
2025-05-07T20:33:01.5910483Z         self,
2025-05-07T20:33:01.5910876Z         T: int,
2025-05-07T20:33:01.5911287Z         D: int,
2025-05-07T20:33:01.5911742Z         scale_ub: Optional[float],
2025-05-07T20:33:01.5912295Z         contiguous: bool,
2025-05-07T20:33:01.5912787Z         compiled: bool,
2025-05-07T20:33:01.5913101Z     ) -> None:
2025-05-07T20:33:01.5913601Z         torch.manual_seed(2025)
2025-05-07T20:33:01.5913872Z     
2025-05-07T20:33:01.5914264Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.5914632Z     
2025-05-07T20:33:01.5914837Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.5915163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.5915507Z         x = x_sign * x_clamp
2025-05-07T20:33:01.5915875Z         x0 = x[:, :D]
2025-05-07T20:33:01.5916119Z         x1 = x[:, D:]
2025-05-07T20:33:01.5916349Z     
2025-05-07T20:33:01.5916625Z         if contiguous:
2025-05-07T20:33:01.5916878Z             x0 = x0.contiguous()
2025-05-07T20:33:01.5917320Z             x1 = x1.contiguous()
2025-05-07T20:33:01.5917571Z     
2025-05-07T20:33:01.5917785Z         if scale_ub is not None:
2025-05-07T20:33:01.5918076Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.5918426Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.5918758Z             )
2025-05-07T20:33:01.5918969Z         else:
2025-05-07T20:33:01.5919191Z             scale_ub_tensor = None
2025-05-07T20:33:01.5919468Z     
2025-05-07T20:33:01.5919717Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.5920047Z             op = silu_mul_quant
2025-05-07T20:33:01.5920411Z             if compiled:
2025-05-07T20:33:01.5920674Z                 op = torch.compile(op)
2025-05-07T20:33:01.5921001Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5921287Z     
2025-05-07T20:33:01.5921501Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.5921673Z 
2025-05-07T20:33:01.5921791Z moe/activation_test.py:117: 
2025-05-07T20:33:01.5922093Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5922450Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.5922750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5923471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.5924198Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.5924767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.5925486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.5926175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.5926747Z     kernel = self.compile(
2025-05-07T20:33:01.5927326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.5928020Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.5928436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5928686Z 
2025-05-07T20:33:01.5928903Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7807207d0>
2025-05-07T20:33:01.5930071Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.5931527Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780720400>}
2025-05-07T20:33:01.5933017Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.5934096Z context = <triton._C.libtriton.ir.context object at 0x7ff78050f970>
2025-05-07T20:33:01.5934400Z 
2025-05-07T20:33:01.5934578Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.5935133Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.5935677Z                            module_map=module_map)
2025-05-07T20:33:01.5936067Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.5936438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.5936723Z E       ^
2025-05-07T20:33:01.5937213Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.5937732Z 
2025-05-07T20:33:01.5938204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.5938750Z 
2025-05-07T20:33:01.5938860Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.5939302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.5939735Z     T=128,
2025-05-07T20:33:01.5939931Z     D=5120,
2025-05-07T20:33:01.5940143Z     scale_ub=None,
2025-05-07T20:33:01.5940380Z     contiguous=False,
2025-05-07T20:33:01.5940615Z     compiled=False,
2025-05-07T20:33:01.5940846Z )
2025-05-07T20:33:01.5941191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.5941705Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.5941997Z 
2025-05-07T20:33:01.5942082Z     @given(
2025-05-07T20:33:01.5942336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.5942676Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.5943004Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.5943399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.5943750Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.5944047Z     )
2025-05-07T20:33:01.5944422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.5944894Z     def test_silu_mul_quant(
2025-05-07T20:33:01.5945145Z         self,
2025-05-07T20:33:01.5945360Z         T: int,
2025-05-07T20:33:01.5945580Z         D: int,
2025-05-07T20:33:01.5945809Z         scale_ub: Optional[float],
2025-05-07T20:33:01.5946106Z         contiguous: bool,
2025-05-07T20:33:01.5946372Z         compiled: bool,
2025-05-07T20:33:01.5946604Z     ) -> None:
2025-05-07T20:33:01.5946844Z         torch.manual_seed(2025)
2025-05-07T20:33:01.5947105Z     
2025-05-07T20:33:01.5947405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.5947767Z     
2025-05-07T20:33:01.5947983Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.5948299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.5948624Z         x = x_sign * x_clamp
2025-05-07T20:33:01.5948887Z         x0 = x[:, :D]
2025-05-07T20:33:01.5949124Z         x1 = x[:, D:]
2025-05-07T20:33:01.5949344Z     
2025-05-07T20:33:01.5949550Z         if contiguous:
2025-05-07T20:33:01.5949803Z             x0 = x0.contiguous()
2025-05-07T20:33:01.5950079Z             x1 = x1.contiguous()
2025-05-07T20:33:01.5950343Z     
2025-05-07T20:33:01.5950561Z         if scale_ub is not None:
2025-05-07T20:33:01.5950847Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.5951215Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.5951544Z             )
2025-05-07T20:33:01.5951754Z         else:
2025-05-07T20:33:01.5951972Z             scale_ub_tensor = None
2025-05-07T20:33:01.5952245Z     
2025-05-07T20:33:01.5952544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.5952877Z             op = silu_mul_quant
2025-05-07T20:33:01.5953149Z             if compiled:
2025-05-07T20:33:01.5953416Z                 op = torch.compile(op)
2025-05-07T20:33:01.5953723Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5954019Z     
2025-05-07T20:33:01.5954228Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.5954400Z 
2025-05-07T20:33:01.5954506Z moe/activation_test.py:117: 
2025-05-07T20:33:01.5954821Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5955220Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.5955522Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5956233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.5956996Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.5957604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.5958313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.5959009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.5959567Z     kernel = self.compile(
2025-05-07T20:33:01.5960223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.5960904Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.5961324Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5961573Z 
2025-05-07T20:33:01.5961787Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780b32e90>
2025-05-07T20:33:01.5962916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.5964395Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781029f80>}
2025-05-07T20:33:01.5965796Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.5966865Z context = <triton._C.libtriton.ir.context object at 0x7ff780af7f30>
2025-05-07T20:33:01.5967167Z 
2025-05-07T20:33:01.5967351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.5967902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.5968415Z                            module_map=module_map)
2025-05-07T20:33:01.5968799Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.5969177Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.5969457Z E       ^
2025-05-07T20:33:01.5969939Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.5970418Z 
2025-05-07T20:33:01.5970851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.5971391Z 
2025-05-07T20:33:01.5971501Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.5971944Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.5972367Z     T=128,
2025-05-07T20:33:01.5972571Z     D=5120,
2025-05-07T20:33:01.5972781Z     scale_ub=1200.0,
2025-05-07T20:33:01.5973012Z     contiguous=True,
2025-05-07T20:33:01.5973249Z     compiled=False,
2025-05-07T20:33:01.5973473Z )
2025-05-07T20:33:01.5973853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.5974376Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.5974667Z 
2025-05-07T20:33:01.5974748Z     @given(
2025-05-07T20:33:01.5974992Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.5975316Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.5975643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.5975990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.5976328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.5976675Z     )
2025-05-07T20:33:01.5977043Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.5977498Z     def test_silu_mul_quant(
2025-05-07T20:33:01.5977757Z         self,
2025-05-07T20:33:01.5977969Z         T: int,
2025-05-07T20:33:01.5978218Z         D: int,
2025-05-07T20:33:01.5978452Z         scale_ub: Optional[float],
2025-05-07T20:33:01.5978744Z         contiguous: bool,
2025-05-07T20:33:01.5979032Z         compiled: bool,
2025-05-07T20:33:01.5979273Z     ) -> None:
2025-05-07T20:33:01.5979507Z         torch.manual_seed(2025)
2025-05-07T20:33:01.5979764Z     
2025-05-07T20:33:01.5980049Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.5980413Z     
2025-05-07T20:33:01.5980621Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.5980925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.5981253Z         x = x_sign * x_clamp
2025-05-07T20:33:01.5981512Z         x0 = x[:, :D]
2025-05-07T20:33:01.5981736Z         x1 = x[:, D:]
2025-05-07T20:33:01.5981964Z     
2025-05-07T20:33:01.5982166Z         if contiguous:
2025-05-07T20:33:01.5982406Z             x0 = x0.contiguous()
2025-05-07T20:33:01.5982687Z             x1 = x1.contiguous()
2025-05-07T20:33:01.5982944Z     
2025-05-07T20:33:01.5983150Z         if scale_ub is not None:
2025-05-07T20:33:01.5983444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.5983806Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.5984126Z             )
2025-05-07T20:33:01.5984338Z         else:
2025-05-07T20:33:01.5984564Z             scale_ub_tensor = None
2025-05-07T20:33:01.5984835Z     
2025-05-07T20:33:01.5985075Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.5985410Z             op = silu_mul_quant
2025-05-07T20:33:01.5985680Z             if compiled:
2025-05-07T20:33:01.5985936Z                 op = torch.compile(op)
2025-05-07T20:33:01.5986252Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5986545Z     
2025-05-07T20:33:01.5986748Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.5986936Z 
2025-05-07T20:33:01.5987041Z moe/activation_test.py:117: 
2025-05-07T20:33:01.5987354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5987705Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.5988012Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.5988738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.5989464Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.5990023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.5990735Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.5991429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.5991985Z     kernel = self.compile(
2025-05-07T20:33:01.5992554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.5993241Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.5993716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.5993959Z 
2025-05-07T20:33:01.5994176Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a6cfd520>
2025-05-07T20:33:01.5995300Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.5996730Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780518c20>}
2025-05-07T20:33:01.5998204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.5999308Z context = <triton._C.libtriton.ir.context object at 0x7ff780313db0>
2025-05-07T20:33:01.5999611Z 
2025-05-07T20:33:01.5999821Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.6000433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.6000932Z                            module_map=module_map)
2025-05-07T20:33:01.6001310Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.6001687Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.6001967Z E       ^
2025-05-07T20:33:01.6002457Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.6002928Z 
2025-05-07T20:33:01.6003398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.7552010Z 
2025-05-07T20:33:01.7552333Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.7553084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.7554186Z     T=1,
2025-05-07T20:33:01.7554578Z     D=7168,
2025-05-07T20:33:01.7554989Z     scale_ub=1200.0,
2025-05-07T20:33:01.7555444Z     contiguous=True,
2025-05-07T20:33:01.7555904Z     compiled=True,
2025-05-07T20:33:01.7556331Z )
2025-05-07T20:33:01.7556984Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.7558009Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.7558552Z 
2025-05-07T20:33:01.7558724Z     @given(
2025-05-07T20:33:01.7559202Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.7559858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.7560627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.7561324Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.7562006Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.7562609Z     )
2025-05-07T20:33:01.7563300Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.7563813Z     def test_silu_mul_quant(
2025-05-07T20:33:01.7564076Z         self,
2025-05-07T20:33:01.7564290Z         T: int,
2025-05-07T20:33:01.7564498Z         D: int,
2025-05-07T20:33:01.7564738Z         scale_ub: Optional[float],
2025-05-07T20:33:01.7565031Z         contiguous: bool,
2025-05-07T20:33:01.7565285Z         compiled: bool,
2025-05-07T20:33:01.7565534Z     ) -> None:
2025-05-07T20:33:01.7565772Z         torch.manual_seed(2025)
2025-05-07T20:33:01.7566026Z     
2025-05-07T20:33:01.7566326Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.7566695Z     
2025-05-07T20:33:01.7566900Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.7567219Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.7567555Z         x = x_sign * x_clamp
2025-05-07T20:33:01.7567820Z         x0 = x[:, :D]
2025-05-07T20:33:01.7568048Z         x1 = x[:, D:]
2025-05-07T20:33:01.7568556Z     
2025-05-07T20:33:01.7568770Z         if contiguous:
2025-05-07T20:33:01.7569018Z             x0 = x0.contiguous()
2025-05-07T20:33:01.7569304Z             x1 = x1.contiguous()
2025-05-07T20:33:01.7569567Z     
2025-05-07T20:33:01.7569774Z         if scale_ub is not None:
2025-05-07T20:33:01.7570070Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.7570436Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.7570764Z             )
2025-05-07T20:33:01.7570977Z         else:
2025-05-07T20:33:01.7571209Z             scale_ub_tensor = None
2025-05-07T20:33:01.7571568Z     
2025-05-07T20:33:01.7571822Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.7572165Z             op = silu_mul_quant
2025-05-07T20:33:01.7572431Z             if compiled:
2025-05-07T20:33:01.7572700Z                 op = torch.compile(op)
2025-05-07T20:33:01.7573109Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.7573415Z     
2025-05-07T20:33:01.7573695Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.7573878Z 
2025-05-07T20:33:01.7573990Z moe/activation_test.py:117: 
2025-05-07T20:33:01.7574306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.7574662Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.7574967Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.7575568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.7576159Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.7576859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.7577588Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.7578159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.7578886Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.7579603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.7580168Z     kernel = self.compile(
2025-05-07T20:33:01.7580747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.7581433Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.7581860Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.7582106Z 
2025-05-07T20:33:01.7582334Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c6750>
2025-05-07T20:33:01.7583456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.7584918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780519ee0>}
2025-05-07T20:33:01.7586323Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.7587393Z context = <triton._C.libtriton.ir.context object at 0x7ff780332330>
2025-05-07T20:33:01.7587698Z 
2025-05-07T20:33:01.7587888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.7588437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.7588941Z                            module_map=module_map)
2025-05-07T20:33:01.7589337Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.7589720Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.7590045Z E       ^
2025-05-07T20:33:01.7590547Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.7591021Z 
2025-05-07T20:33:01.7591466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.7592003Z 
2025-05-07T20:33:01.7592115Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.7592561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.7593038Z     T=1,
2025-05-07T20:33:01.7593244Z     D=7168,
2025-05-07T20:33:01.7593448Z     scale_ub=1200.0,
2025-05-07T20:33:01.7593694Z     contiguous=False,
2025-05-07T20:33:01.7593939Z     compiled=True,
2025-05-07T20:33:01.7594154Z )
2025-05-07T20:33:01.7601937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.7602602Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.7602890Z 
2025-05-07T20:33:01.7603025Z     @given(
2025-05-07T20:33:01.7603292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.7603641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.7603970Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.7604338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.7604699Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.7605017Z     )
2025-05-07T20:33:01.7605391Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.7605895Z     def test_silu_mul_quant(
2025-05-07T20:33:01.7606165Z         self,
2025-05-07T20:33:01.7606383Z         T: int,
2025-05-07T20:33:01.7606593Z         D: int,
2025-05-07T20:33:01.7606834Z         scale_ub: Optional[float],
2025-05-07T20:33:01.7607132Z         contiguous: bool,
2025-05-07T20:33:01.7607392Z         compiled: bool,
2025-05-07T20:33:01.7607643Z     ) -> None:
2025-05-07T20:33:01.7607886Z         torch.manual_seed(2025)
2025-05-07T20:33:01.7608143Z     
2025-05-07T20:33:01.7608444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.7608815Z     
2025-05-07T20:33:01.7609021Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.7609341Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.7609678Z         x = x_sign * x_clamp
2025-05-07T20:33:01.7609943Z         x0 = x[:, :D]
2025-05-07T20:33:01.7610173Z         x1 = x[:, D:]
2025-05-07T20:33:01.7610402Z     
2025-05-07T20:33:01.7610612Z         if contiguous:
2025-05-07T20:33:01.7610859Z             x0 = x0.contiguous()
2025-05-07T20:33:01.7611142Z             x1 = x1.contiguous()
2025-05-07T20:33:01.7611406Z     
2025-05-07T20:33:01.7611610Z         if scale_ub is not None:
2025-05-07T20:33:01.7611909Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.7612286Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.7612620Z             )
2025-05-07T20:33:01.7612839Z         else:
2025-05-07T20:33:01.7613071Z             scale_ub_tensor = None
2025-05-07T20:33:01.7613633Z     
2025-05-07T20:33:01.7613895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.7614241Z             op = silu_mul_quant
2025-05-07T20:33:01.7614504Z             if compiled:
2025-05-07T20:33:01.7614776Z                 op = torch.compile(op)
2025-05-07T20:33:01.7615100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.7615388Z     
2025-05-07T20:33:01.7615595Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.7615780Z 
2025-05-07T20:33:01.7615884Z moe/activation_test.py:117: 
2025-05-07T20:33:01.7616198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.7616545Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.7616848Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.7617532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.7618121Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.7618812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.7619535Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.7620108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.7620821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.7621599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.7622163Z     kernel = self.compile(
2025-05-07T20:33:01.7622732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.7623501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.7623979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.7624225Z 
2025-05-07T20:33:01.7624452Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a60f4bd0>
2025-05-07T20:33:01.7625579Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.7627019Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78051ac00>}
2025-05-07T20:33:01.7628428Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.7629506Z context = <triton._C.libtriton.ir.context object at 0x7ff780302ab0>
2025-05-07T20:33:01.7629812Z 
2025-05-07T20:33:01.7629998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.7630548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.7631046Z                            module_map=module_map)
2025-05-07T20:33:01.7631437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.7631809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.7632090Z E       ^
2025-05-07T20:33:01.7632587Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.7633056Z 
2025-05-07T20:33:01.7633502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.9666365Z 
2025-05-07T20:33:01.9666789Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.9668082Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.9669270Z     T=1,
2025-05-07T20:33:01.9669796Z     D=7168,
2025-05-07T20:33:01.9670221Z     scale_ub=None,
2025-05-07T20:33:01.9670669Z     contiguous=False,
2025-05-07T20:33:01.9671136Z     compiled=True,
2025-05-07T20:33:01.9671551Z )
2025-05-07T20:33:01.9672213Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.9673158Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.9673435Z 
2025-05-07T20:33:01.9673527Z     @given(
2025-05-07T20:33:01.9673772Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.9674103Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.9674422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.9674774Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.9675129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.9675711Z     )
2025-05-07T20:33:01.9676082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.9676549Z     def test_silu_mul_quant(
2025-05-07T20:33:01.9676814Z         self,
2025-05-07T20:33:01.9677019Z         T: int,
2025-05-07T20:33:01.9677234Z         D: int,
2025-05-07T20:33:01.9677503Z         scale_ub: Optional[float],
2025-05-07T20:33:01.9677792Z         contiguous: bool,
2025-05-07T20:33:01.9678051Z         compiled: bool,
2025-05-07T20:33:01.9678290Z     ) -> None:
2025-05-07T20:33:01.9678524Z         torch.manual_seed(2025)
2025-05-07T20:33:01.9678867Z     
2025-05-07T20:33:01.9679154Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.9679519Z     
2025-05-07T20:33:01.9679732Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.9680035Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.9680566Z         x = x_sign * x_clamp
2025-05-07T20:33:01.9680827Z         x0 = x[:, :D]
2025-05-07T20:33:01.9681128Z         x1 = x[:, D:]
2025-05-07T20:33:01.9681359Z     
2025-05-07T20:33:01.9681563Z         if contiguous:
2025-05-07T20:33:01.9681802Z             x0 = x0.contiguous()
2025-05-07T20:33:01.9682087Z             x1 = x1.contiguous()
2025-05-07T20:33:01.9682344Z     
2025-05-07T20:33:01.9682544Z         if scale_ub is not None:
2025-05-07T20:33:01.9682842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.9683204Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.9683532Z             )
2025-05-07T20:33:01.9683735Z         else:
2025-05-07T20:33:01.9683963Z             scale_ub_tensor = None
2025-05-07T20:33:01.9684229Z     
2025-05-07T20:33:01.9684474Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9684811Z             op = silu_mul_quant
2025-05-07T20:33:01.9685079Z             if compiled:
2025-05-07T20:33:01.9685343Z                 op = torch.compile(op)
2025-05-07T20:33:01.9685661Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9685958Z     
2025-05-07T20:33:01.9686159Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.9686466Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.9686779Z     
2025-05-07T20:33:01.9687024Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9687380Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.9687692Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.9688028Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.9688407Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.9688746Z     
2025-05-07T20:33:01.9688963Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.9689169Z 
2025-05-07T20:33:01.9689277Z moe/activation_test.py:126: 
2025-05-07T20:33:01.9689596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9689956Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.9690305Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.9691136Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.9691925Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.9692508Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.9693221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.9693952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.9694722Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.9695498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.9696224Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.9696869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.9697423Z     fn()
2025-05-07T20:33:01.9697955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.9698569Z     self.fn.run(
2025-05-07T20:33:01.9699069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.9699677Z     kernel = self.compile(
2025-05-07T20:33:01.9700252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.9700941Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.9701369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9701658Z 
2025-05-07T20:33:01.9701923Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a60f5350>
2025-05-07T20:33:01.9703051Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.9704501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3c180>}
2025-05-07T20:33:01.9705902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.9706964Z context = <triton._C.libtriton.ir.context object at 0x7ff780ae6ef0>
2025-05-07T20:33:01.9707273Z 
2025-05-07T20:33:01.9707459Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.9708012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.9708509Z                            module_map=module_map)
2025-05-07T20:33:01.9708900Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.9709274Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.9709559Z E       ^
2025-05-07T20:33:01.9710049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.9710528Z 
2025-05-07T20:33:01.9710973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.9711509Z 
2025-05-07T20:33:01.9711620Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.9712059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.9712489Z     T=1,
2025-05-07T20:33:01.9712683Z     D=5120,
2025-05-07T20:33:01.9712890Z     scale_ub=1200.0,
2025-05-07T20:33:01.9713131Z     contiguous=False,
2025-05-07T20:33:01.9713657Z     compiled=True,
2025-05-07T20:33:01.9713872Z )
2025-05-07T20:33:01.9714207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.9714714Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.9714994Z 
2025-05-07T20:33:01.9715076Z     @given(
2025-05-07T20:33:01.9715320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.9715658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.9715980Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.9716331Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.9716680Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.9716993Z     )
2025-05-07T20:33:01.9717364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.9717934Z     def test_silu_mul_quant(
2025-05-07T20:33:01.9718193Z         self,
2025-05-07T20:33:01.9718400Z         T: int,
2025-05-07T20:33:01.9718602Z         D: int,
2025-05-07T20:33:01.9718832Z         scale_ub: Optional[float],
2025-05-07T20:33:01.9719119Z         contiguous: bool,
2025-05-07T20:33:01.9719370Z         compiled: bool,
2025-05-07T20:33:01.9719610Z     ) -> None:
2025-05-07T20:33:01.9719838Z         torch.manual_seed(2025)
2025-05-07T20:33:01.9720157Z     
2025-05-07T20:33:01.9720446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.9720876Z     
2025-05-07T20:33:01.9721076Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.9721390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.9721715Z         x = x_sign * x_clamp
2025-05-07T20:33:01.9721961Z         x0 = x[:, :D]
2025-05-07T20:33:01.9722188Z         x1 = x[:, D:]
2025-05-07T20:33:01.9722468Z     
2025-05-07T20:33:01.9722656Z         if contiguous:
2025-05-07T20:33:01.9722899Z             x0 = x0.contiguous()
2025-05-07T20:33:01.9723229Z             x1 = x1.contiguous()
2025-05-07T20:33:01.9723483Z     
2025-05-07T20:33:01.9723678Z         if scale_ub is not None:
2025-05-07T20:33:01.9723966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.9724319Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.9724635Z             )
2025-05-07T20:33:01.9724838Z         else:
2025-05-07T20:33:01.9725059Z             scale_ub_tensor = None
2025-05-07T20:33:01.9725315Z     
2025-05-07T20:33:01.9725557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.9725892Z             op = silu_mul_quant
2025-05-07T20:33:01.9726149Z             if compiled:
2025-05-07T20:33:01.9726413Z                 op = torch.compile(op)
2025-05-07T20:33:01.9726726Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9727014Z     
2025-05-07T20:33:01.9727218Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.9727389Z 
2025-05-07T20:33:01.9727502Z moe/activation_test.py:117: 
2025-05-07T20:33:01.9727818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9728161Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.9728460Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.9729045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.9729621Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.9730308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.9731029Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.9731589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.9732298Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.9733012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.9733607Z     kernel = self.compile(
2025-05-07T20:33:01.9734168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.9734854Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.9735272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.9735511Z 
2025-05-07T20:33:01.9735732Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c7bd0>
2025-05-07T20:33:01.9736852Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.9738335Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3d300>}
2025-05-07T20:33:01.9739731Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.9740791Z context = <triton._C.libtriton.ir.context object at 0x7ff780011eb0>
2025-05-07T20:33:01.9741091Z 
2025-05-07T20:33:01.9741273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.9741860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.9742352Z                            module_map=module_map)
2025-05-07T20:33:01.9742740Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.9743107Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.9743421Z E       ^
2025-05-07T20:33:01.9743950Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.9744417Z 
2025-05-07T20:33:01.9744855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.1142022Z 
2025-05-07T20:33:02.1142463Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.1143187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.1144363Z     T=1,
2025-05-07T20:33:02.1144884Z     D=5120,
2025-05-07T20:33:02.1145405Z     scale_ub=1200.0,
2025-05-07T20:33:02.1145897Z     contiguous=False,
2025-05-07T20:33:02.1146374Z     compiled=False,
2025-05-07T20:33:02.1146795Z )
2025-05-07T20:33:02.1147466Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.1148494Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.1149062Z 
2025-05-07T20:33:02.1149235Z     @given(
2025-05-07T20:33:02.1149726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.1150379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.1151022Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.1151706Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.1152395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.1152992Z     )
2025-05-07T20:33:02.1153540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.1154032Z     def test_silu_mul_quant(
2025-05-07T20:33:02.1154298Z         self,
2025-05-07T20:33:02.1154505Z         T: int,
2025-05-07T20:33:02.1154720Z         D: int,
2025-05-07T20:33:02.1154957Z         scale_ub: Optional[float],
2025-05-07T20:33:02.1155242Z         contiguous: bool,
2025-05-07T20:33:02.1155502Z         compiled: bool,
2025-05-07T20:33:02.1155741Z     ) -> None:
2025-05-07T20:33:02.1155975Z         torch.manual_seed(2025)
2025-05-07T20:33:02.1156226Z     
2025-05-07T20:33:02.1156524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.1156897Z     
2025-05-07T20:33:02.1157107Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.1157420Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.1157750Z         x = x_sign * x_clamp
2025-05-07T20:33:02.1158006Z         x0 = x[:, :D]
2025-05-07T20:33:02.1158238Z         x1 = x[:, D:]
2025-05-07T20:33:02.1158462Z     
2025-05-07T20:33:02.1158656Z         if contiguous:
2025-05-07T20:33:02.1158905Z             x0 = x0.contiguous()
2025-05-07T20:33:02.1159186Z             x1 = x1.contiguous()
2025-05-07T20:33:02.1159433Z     
2025-05-07T20:33:02.1159638Z         if scale_ub is not None:
2025-05-07T20:33:02.1159932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.1160374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.1160711Z             )
2025-05-07T20:33:02.1160944Z         else:
2025-05-07T20:33:02.1161416Z             scale_ub_tensor = None
2025-05-07T20:33:02.1161693Z     
2025-05-07T20:33:02.1161932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.1162263Z             op = silu_mul_quant
2025-05-07T20:33:02.1162529Z             if compiled:
2025-05-07T20:33:02.1162790Z                 op = torch.compile(op)
2025-05-07T20:33:02.1163102Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1163394Z     
2025-05-07T20:33:02.1163599Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.1163771Z 
2025-05-07T20:33:02.1163878Z moe/activation_test.py:117: 
2025-05-07T20:33:02.1164280Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1164630Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.1164923Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1165642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.1166438Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.1167068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.1167785Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.1168481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.1169039Z     kernel = self.compile(
2025-05-07T20:33:02.1169602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.1170292Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.1170718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1170959Z 
2025-05-07T20:33:02.1171183Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781208b50>
2025-05-07T20:33:02.1172307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.1173807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3e020>}
2025-05-07T20:33:02.1175205Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.1176272Z context = <triton._C.libtriton.ir.context object at 0x7ff780a0ac70>
2025-05-07T20:33:02.1176573Z 
2025-05-07T20:33:02.1176756Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.1177308Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.1177806Z                            module_map=module_map)
2025-05-07T20:33:02.1178191Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.1178561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.1178842Z E       ^
2025-05-07T20:33:02.1179330Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.1179800Z 
2025-05-07T20:33:02.1180243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.1180779Z 
2025-05-07T20:33:02.1180891Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.1181331Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.1181756Z     T=16384,
2025-05-07T20:33:02.1181957Z     D=5120,
2025-05-07T20:33:02.1182173Z     scale_ub=1200.0,
2025-05-07T20:33:02.1182412Z     contiguous=False,
2025-05-07T20:33:02.1182694Z     compiled=True,
2025-05-07T20:33:02.1182910Z )
2025-05-07T20:33:02.1183262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.1183829Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.1184129Z 
2025-05-07T20:33:02.1184211Z     @given(
2025-05-07T20:33:02.1184459Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.1184792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.1185112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.1185506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.1185853Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.1186149Z     )
2025-05-07T20:33:02.1186519Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.1186985Z     def test_silu_mul_quant(
2025-05-07T20:33:02.1187275Z         self,
2025-05-07T20:33:02.1187482Z         T: int,
2025-05-07T20:33:02.1187692Z         D: int,
2025-05-07T20:33:02.1187988Z         scale_ub: Optional[float],
2025-05-07T20:33:02.1188279Z         contiguous: bool,
2025-05-07T20:33:02.1188535Z         compiled: bool,
2025-05-07T20:33:02.1188774Z     ) -> None:
2025-05-07T20:33:02.1188998Z         torch.manual_seed(2025)
2025-05-07T20:33:02.1189256Z     
2025-05-07T20:33:02.1189545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.1189898Z     
2025-05-07T20:33:02.1190109Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.1190419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.1190741Z         x = x_sign * x_clamp
2025-05-07T20:33:02.1190994Z         x0 = x[:, :D]
2025-05-07T20:33:02.1191224Z         x1 = x[:, D:]
2025-05-07T20:33:02.1191438Z     
2025-05-07T20:33:02.1191635Z         if contiguous:
2025-05-07T20:33:02.1191881Z             x0 = x0.contiguous()
2025-05-07T20:33:02.1192152Z             x1 = x1.contiguous()
2025-05-07T20:33:02.1192406Z     
2025-05-07T20:33:02.1192625Z         if scale_ub is not None:
2025-05-07T20:33:02.1200391Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.1200766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.1201092Z             )
2025-05-07T20:33:02.1201304Z         else:
2025-05-07T20:33:02.1201537Z             scale_ub_tensor = None
2025-05-07T20:33:02.1201805Z     
2025-05-07T20:33:02.1202062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.1202406Z             op = silu_mul_quant
2025-05-07T20:33:02.1202676Z             if compiled:
2025-05-07T20:33:02.1202946Z                 op = torch.compile(op)
2025-05-07T20:33:02.1203269Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1203560Z     
2025-05-07T20:33:02.1203771Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.1203955Z 
2025-05-07T20:33:02.1204064Z moe/activation_test.py:117: 
2025-05-07T20:33:02.1204389Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1204743Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.1205046Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.1205644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.1206231Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.1206932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.1207652Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.1208225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.1208936Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.1209639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.1210280Z     kernel = self.compile(
2025-05-07T20:33:02.1210855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.1211549Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.1211975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.1212219Z 
2025-05-07T20:33:02.1212447Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff42950>
2025-05-07T20:33:02.1213831Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.1215374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3f600>}
2025-05-07T20:33:02.1216901Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.1217974Z context = <triton._C.libtriton.ir.context object at 0x7ff7800986f0>
2025-05-07T20:33:02.1218278Z 
2025-05-07T20:33:02.1218463Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.1219012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.1219509Z                            module_map=module_map)
2025-05-07T20:33:02.1219897Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.1220271Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.1220553Z E       ^
2025-05-07T20:33:02.1221044Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.1221517Z 
2025-05-07T20:33:02.1221965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.1222499Z 
2025-05-07T20:33:02.1222609Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.1223049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.1223473Z     T=2048,
2025-05-07T20:33:02.1223666Z     D=7168,
2025-05-07T20:33:02.1223872Z     scale_ub=1200.0,
2025-05-07T20:33:02.1224116Z     contiguous=False,
2025-05-07T20:33:02.1224357Z     compiled=True,
2025-05-07T20:33:02.3085998Z )
2025-05-07T20:33:02.3086551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3087302Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.3087678Z 
2025-05-07T20:33:02.3087773Z     @given(
2025-05-07T20:33:02.3088019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3088383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3088724Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3089073Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3089435Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3089750Z     )
2025-05-07T20:33:02.3090129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3090597Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3090869Z         self,
2025-05-07T20:33:02.3091096Z         T: int,
2025-05-07T20:33:02.3091307Z         D: int,
2025-05-07T20:33:02.3091554Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3091846Z         contiguous: bool,
2025-05-07T20:33:02.3092098Z         compiled: bool,
2025-05-07T20:33:02.3092345Z     ) -> None:
2025-05-07T20:33:02.3092578Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3092830Z     
2025-05-07T20:33:02.3093127Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3093780Z     
2025-05-07T20:33:02.3093995Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3094311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3094644Z         x = x_sign * x_clamp
2025-05-07T20:33:02.3094896Z         x0 = x[:, :D]
2025-05-07T20:33:02.3095130Z         x1 = x[:, D:]
2025-05-07T20:33:02.3095363Z     
2025-05-07T20:33:02.3095555Z         if contiguous:
2025-05-07T20:33:02.3095801Z             x0 = x0.contiguous()
2025-05-07T20:33:02.3096079Z             x1 = x1.contiguous()
2025-05-07T20:33:02.3096339Z     
2025-05-07T20:33:02.3096635Z         if scale_ub is not None:
2025-05-07T20:33:02.3096932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.3097294Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.3097619Z             )
2025-05-07T20:33:02.3097830Z         else:
2025-05-07T20:33:02.3098056Z             scale_ub_tensor = None
2025-05-07T20:33:02.3098401Z     
2025-05-07T20:33:02.3098656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.3099076Z             op = silu_mul_quant
2025-05-07T20:33:02.3099344Z             if compiled:
2025-05-07T20:33:02.3099614Z                 op = torch.compile(op)
2025-05-07T20:33:02.3099940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3100229Z     
2025-05-07T20:33:02.3100446Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.3100620Z 
2025-05-07T20:33:02.3100743Z moe/activation_test.py:117: 
2025-05-07T20:33:02.3101064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3101420Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.3101724Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3102319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.3102909Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.3103614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.3104342Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.3104913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.3105628Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.3106330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.3106891Z     kernel = self.compile(
2025-05-07T20:33:02.3107461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.3108156Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.3108587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3108832Z 
2025-05-07T20:33:02.3109064Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781209f50>
2025-05-07T20:33:02.3110187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.3111648Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780038720>}
2025-05-07T20:33:02.3113057Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.3114534Z context = <triton._C.libtriton.ir.context object at 0x7ff78003fcb0>
2025-05-07T20:33:02.3114838Z 
2025-05-07T20:33:02.3115024Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.3115657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.3116158Z                            module_map=module_map)
2025-05-07T20:33:02.3116550Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.3116925Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.3117246Z E       ^
2025-05-07T20:33:02.3117736Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.3118218Z 
2025-05-07T20:33:02.3118655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.3119263Z 
2025-05-07T20:33:02.3119375Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3119820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3120337Z     T=1,
2025-05-07T20:33:02.3120607Z     D=5120,
2025-05-07T20:33:02.3120818Z     scale_ub=None,
2025-05-07T20:33:02.3121051Z     contiguous=False,
2025-05-07T20:33:02.3121355Z     compiled=False,
2025-05-07T20:33:02.3121578Z )
2025-05-07T20:33:02.3121914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3122436Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.3122712Z 
2025-05-07T20:33:02.3122801Z     @given(
2025-05-07T20:33:02.3123046Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3123383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3123712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3124074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3124422Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3124730Z     )
2025-05-07T20:33:02.3125107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3125573Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3125844Z         self,
2025-05-07T20:33:02.3126056Z         T: int,
2025-05-07T20:33:02.3126264Z         D: int,
2025-05-07T20:33:02.3126500Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3126794Z         contiguous: bool,
2025-05-07T20:33:02.3127045Z         compiled: bool,
2025-05-07T20:33:02.3127287Z     ) -> None:
2025-05-07T20:33:02.3127520Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3127780Z     
2025-05-07T20:33:02.3128067Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3128431Z     
2025-05-07T20:33:02.3128640Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3128950Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3129282Z         x = x_sign * x_clamp
2025-05-07T20:33:02.3129545Z         x0 = x[:, :D]
2025-05-07T20:33:02.3129778Z         x1 = x[:, D:]
2025-05-07T20:33:02.3130000Z     
2025-05-07T20:33:02.3130199Z         if contiguous:
2025-05-07T20:33:02.3130444Z             x0 = x0.contiguous()
2025-05-07T20:33:02.3130723Z             x1 = x1.contiguous()
2025-05-07T20:33:02.3130983Z     
2025-05-07T20:33:02.3131182Z         if scale_ub is not None:
2025-05-07T20:33:02.3131474Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.3131834Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.3132155Z             )
2025-05-07T20:33:02.3132362Z         else:
2025-05-07T20:33:02.3132589Z             scale_ub_tensor = None
2025-05-07T20:33:02.3132859Z     
2025-05-07T20:33:02.3133102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.3133490Z             op = silu_mul_quant
2025-05-07T20:33:02.3133759Z             if compiled:
2025-05-07T20:33:02.3134019Z                 op = torch.compile(op)
2025-05-07T20:33:02.3134334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3134641Z     
2025-05-07T20:33:02.3134848Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.3135024Z 
2025-05-07T20:33:02.3135131Z moe/activation_test.py:117: 
2025-05-07T20:33:02.3135503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3135862Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.3136167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3136887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.3137609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.3138184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.3138944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.3139643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.3140204Z     kernel = self.compile(
2025-05-07T20:33:02.3140856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.3141581Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.3142008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3142251Z 
2025-05-07T20:33:02.3142475Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180f450>
2025-05-07T20:33:02.3143604Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.3145030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780039120>}
2025-05-07T20:33:02.3146431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.3147495Z context = <triton._C.libtriton.ir.context object at 0x7ff781804530>
2025-05-07T20:33:02.3147798Z 
2025-05-07T20:33:02.3147982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.3148534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.3149027Z                            module_map=module_map)
2025-05-07T20:33:02.3149415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.3149792Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.3150063Z E       ^
2025-05-07T20:33:02.3150553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.3151022Z 
2025-05-07T20:33:02.3151464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.3152003Z 
2025-05-07T20:33:02.3152118Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.3152562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.3152988Z     T=4096,
2025-05-07T20:33:02.3153207Z     D=7168,
2025-05-07T20:33:02.3153439Z     scale_ub=1200.0,
2025-05-07T20:33:02.3153684Z     contiguous=False,
2025-05-07T20:33:02.3153928Z     compiled=False,
2025-05-07T20:33:02.3154140Z )
2025-05-07T20:33:02.3154479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.3155011Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.3155300Z 
2025-05-07T20:33:02.3155385Z     @given(
2025-05-07T20:33:02.3155633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.3155967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.3156291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.3156709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.3157073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.3157380Z     )
2025-05-07T20:33:02.3157747Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.3158221Z     def test_silu_mul_quant(
2025-05-07T20:33:02.3158481Z         self,
2025-05-07T20:33:02.3158686Z         T: int,
2025-05-07T20:33:02.3158903Z         D: int,
2025-05-07T20:33:02.3159139Z         scale_ub: Optional[float],
2025-05-07T20:33:02.3159428Z         contiguous: bool,
2025-05-07T20:33:02.3159735Z         compiled: bool,
2025-05-07T20:33:02.3159974Z     ) -> None:
2025-05-07T20:33:02.3160337Z         torch.manual_seed(2025)
2025-05-07T20:33:02.3160599Z     
2025-05-07T20:33:02.3160894Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.3161256Z     
2025-05-07T20:33:02.3161512Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.3161828Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.3162205Z         x = x_sign * x_clamp
2025-05-07T20:33:02.3162460Z         x0 = x[:, :D]
2025-05-07T20:33:02.3162693Z         x1 = x[:, D:]
2025-05-07T20:33:02.3162917Z     
2025-05-07T20:33:02.3163112Z         if contiguous:
2025-05-07T20:33:02.3163359Z             x0 = x0.contiguous()
2025-05-07T20:33:02.3163636Z             x1 = x1.contiguous()
2025-05-07T20:33:02.3163885Z     
2025-05-07T20:33:02.3164090Z         if scale_ub is not None:
2025-05-07T20:33:02.3164385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.3164741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.3165075Z             )
2025-05-07T20:33:02.3165284Z         else:
2025-05-07T20:33:02.3165507Z             scale_ub_tensor = None
2025-05-07T20:33:02.3165774Z     
2025-05-07T20:33:02.3166020Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.3166355Z             op = silu_mul_quant
2025-05-07T20:33:02.3166625Z             if compiled:
2025-05-07T20:33:02.3166894Z                 op = torch.compile(op)
2025-05-07T20:33:02.3167205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3167501Z     
2025-05-07T20:33:02.3167708Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.3167881Z 
2025-05-07T20:33:02.3167995Z moe/activation_test.py:117: 
2025-05-07T20:33:02.3168301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3168653Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.3168957Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.3169674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.3170394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.3170963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.3171686Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.3172383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.3172942Z     kernel = self.compile(
2025-05-07T20:33:02.3173513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.3174194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.3174620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.3174869Z 
2025-05-07T20:33:02.3175087Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780fe92d0>
2025-05-07T20:33:02.3176213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.3177700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78003a480>}
2025-05-07T20:33:02.3179092Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.3180158Z context = <triton._C.libtriton.ir.context object at 0x7ff78064ee70>
2025-05-07T20:33:02.3180466Z 
2025-05-07T20:33:02.3180643Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.3181238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.3181727Z                            module_map=module_map)
2025-05-07T20:33:02.3182116Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.3182533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.3182806Z E       ^
2025-05-07T20:33:02.3183338Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.3183818Z 
2025-05-07T20:33:02.3184258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.4739689Z 
2025-05-07T20:33:02.4740043Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.4740715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.4741312Z     T=16384,
2025-05-07T20:33:02.4741577Z     D=7168,
2025-05-07T20:33:02.4741787Z     scale_ub=None,
2025-05-07T20:33:02.4742018Z     contiguous=True,
2025-05-07T20:33:02.4742256Z     compiled=True,
2025-05-07T20:33:02.4742473Z )
2025-05-07T20:33:02.4742809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.4743376Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:02.4743697Z 
2025-05-07T20:33:02.4743797Z     @given(
2025-05-07T20:33:02.4744040Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.4744376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.4744706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.4745052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.4745405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.4745710Z     )
2025-05-07T20:33:02.4746084Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.4746552Z     def test_silu_mul_quant(
2025-05-07T20:33:02.4746811Z         self,
2025-05-07T20:33:02.4747019Z         T: int,
2025-05-07T20:33:02.4747225Z         D: int,
2025-05-07T20:33:02.4747459Z         scale_ub: Optional[float],
2025-05-07T20:33:02.4747748Z         contiguous: bool,
2025-05-07T20:33:02.4748044Z         compiled: bool,
2025-05-07T20:33:02.4748289Z     ) -> None:
2025-05-07T20:33:02.4748516Z         torch.manual_seed(2025)
2025-05-07T20:33:02.4748777Z     
2025-05-07T20:33:02.4749068Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.4749429Z     
2025-05-07T20:33:02.4749639Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.4749953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.4750280Z         x = x_sign * x_clamp
2025-05-07T20:33:02.4750541Z         x0 = x[:, :D]
2025-05-07T20:33:02.4750783Z         x1 = x[:, D:]
2025-05-07T20:33:02.4751000Z     
2025-05-07T20:33:02.4751204Z         if contiguous:
2025-05-07T20:33:02.4751455Z             x0 = x0.contiguous()
2025-05-07T20:33:02.4751728Z             x1 = x1.contiguous()
2025-05-07T20:33:02.4751993Z     
2025-05-07T20:33:02.4752201Z         if scale_ub is not None:
2025-05-07T20:33:02.4752488Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.4752853Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.4753481Z             )
2025-05-07T20:33:02.4753699Z         else:
2025-05-07T20:33:02.4753921Z             scale_ub_tensor = None
2025-05-07T20:33:02.4754190Z     
2025-05-07T20:33:02.4754442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.4754773Z             op = silu_mul_quant
2025-05-07T20:33:02.4755043Z             if compiled:
2025-05-07T20:33:02.4755311Z                 op = torch.compile(op)
2025-05-07T20:33:02.4755624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4755920Z     
2025-05-07T20:33:02.4756126Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.4756398Z 
2025-05-07T20:33:02.4756506Z moe/activation_test.py:117: 
2025-05-07T20:33:02.4756822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4757178Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.4757480Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4758151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.4758822Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.4759522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.4760346Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.4760919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.4761645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.4762352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.4762910Z     kernel = self.compile(
2025-05-07T20:33:02.4763482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.4764179Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.4764607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4764857Z 
2025-05-07T20:33:02.4765077Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180fb50>
2025-05-07T20:33:02.4766209Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.4767667Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78003b740>}
2025-05-07T20:33:02.4769074Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.4770150Z context = <triton._C.libtriton.ir.context object at 0x7ff781827530>
2025-05-07T20:33:02.4770464Z 
2025-05-07T20:33:02.4770644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.4771199Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.4771692Z                            module_map=module_map)
2025-05-07T20:33:02.4772074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.4772465Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.4780546Z E       ^
2025-05-07T20:33:02.4781057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.4781535Z 
2025-05-07T20:33:02.4781989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.4782536Z 
2025-05-07T20:33:02.4782644Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.4783178Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.4783608Z     T=4096,
2025-05-07T20:33:02.4783803Z     D=5120,
2025-05-07T20:33:02.4784006Z     scale_ub=None,
2025-05-07T20:33:02.4784234Z     contiguous=False,
2025-05-07T20:33:02.4784472Z     compiled=True,
2025-05-07T20:33:02.4784682Z )
2025-05-07T20:33:02.4785022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.4785544Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.4785826Z 
2025-05-07T20:33:02.4785952Z     @given(
2025-05-07T20:33:02.4786194Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.4786524Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.4786837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.4787185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.4787574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.4787869Z     )
2025-05-07T20:33:02.4788279Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.4788745Z     def test_silu_mul_quant(
2025-05-07T20:33:02.4789000Z         self,
2025-05-07T20:33:02.4789198Z         T: int,
2025-05-07T20:33:02.4789407Z         D: int,
2025-05-07T20:33:02.4789636Z         scale_ub: Optional[float],
2025-05-07T20:33:02.4789914Z         contiguous: bool,
2025-05-07T20:33:02.4790167Z         compiled: bool,
2025-05-07T20:33:02.4790402Z     ) -> None:
2025-05-07T20:33:02.4790620Z         torch.manual_seed(2025)
2025-05-07T20:33:02.4790876Z     
2025-05-07T20:33:02.4791162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.4791513Z     
2025-05-07T20:33:02.4791714Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.4792019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.4792339Z         x = x_sign * x_clamp
2025-05-07T20:33:02.4792588Z         x0 = x[:, :D]
2025-05-07T20:33:02.4792818Z         x1 = x[:, D:]
2025-05-07T20:33:02.4793031Z     
2025-05-07T20:33:02.4793231Z         if contiguous:
2025-05-07T20:33:02.4793472Z             x0 = x0.contiguous()
2025-05-07T20:33:02.4793746Z             x1 = x1.contiguous()
2025-05-07T20:33:02.4793993Z     
2025-05-07T20:33:02.4794192Z         if scale_ub is not None:
2025-05-07T20:33:02.4794476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.4794822Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.4795147Z             )
2025-05-07T20:33:02.4795357Z         else:
2025-05-07T20:33:02.4795567Z             scale_ub_tensor = None
2025-05-07T20:33:02.4795817Z     
2025-05-07T20:33:02.4796063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.4796389Z             op = silu_mul_quant
2025-05-07T20:33:02.4796653Z             if compiled:
2025-05-07T20:33:02.4796914Z                 op = torch.compile(op)
2025-05-07T20:33:02.4797226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4797520Z     
2025-05-07T20:33:02.4797728Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.4797899Z 
2025-05-07T20:33:02.4798010Z moe/activation_test.py:117: 
2025-05-07T20:33:02.4798312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4798659Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.4798945Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.4799522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.4800212Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.4800897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.4801609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.4802175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.4802956Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.4803651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.4804199Z     kernel = self.compile(
2025-05-07T20:33:02.4804762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.4805446Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.4805861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.4806142Z 
2025-05-07T20:33:02.4806356Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780fea9d0>
2025-05-07T20:33:02.4807476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.4808980Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780678c20>}
2025-05-07T20:33:02.4810366Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.4811414Z context = <triton._C.libtriton.ir.context object at 0x7ff780603070>
2025-05-07T20:33:02.4811722Z 
2025-05-07T20:33:02.4811894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.4812436Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.4812922Z                            module_map=module_map)
2025-05-07T20:33:02.4813585Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.4814556Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.4814859Z E       ^
2025-05-07T20:33:02.4815357Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.4815837Z 
2025-05-07T20:33:02.4816284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6184531Z 
2025-05-07T20:33:02.6184774Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6185404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6185965Z     T=4096,
2025-05-07T20:33:02.6186165Z     D=5120,
2025-05-07T20:33:02.6186376Z     scale_ub=1200.0,
2025-05-07T20:33:02.6186621Z     contiguous=False,
2025-05-07T20:33:02.6186900Z     compiled=False,
2025-05-07T20:33:02.6187120Z )
2025-05-07T20:33:02.6187464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6188013Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.6188303Z 
2025-05-07T20:33:02.6188392Z     @given(
2025-05-07T20:33:02.6188634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6188976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6189306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6189655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6190011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6190320Z     )
2025-05-07T20:33:02.6190693Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6191166Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6191431Z         self,
2025-05-07T20:33:02.6191635Z         T: int,
2025-05-07T20:33:02.6191851Z         D: int,
2025-05-07T20:33:02.6192087Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6192385Z         contiguous: bool,
2025-05-07T20:33:02.6192792Z         compiled: bool,
2025-05-07T20:33:02.6193043Z     ) -> None:
2025-05-07T20:33:02.6193279Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6193531Z     
2025-05-07T20:33:02.6193822Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6194187Z     
2025-05-07T20:33:02.6194393Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6194707Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6195041Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6195292Z         x0 = x[:, :D]
2025-05-07T20:33:02.6195526Z         x1 = x[:, D:]
2025-05-07T20:33:02.6195827Z     
2025-05-07T20:33:02.6196022Z         if contiguous:
2025-05-07T20:33:02.6196274Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6196553Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6196803Z     
2025-05-07T20:33:02.6197010Z         if scale_ub is not None:
2025-05-07T20:33:02.6197374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6197738Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6198127Z             )
2025-05-07T20:33:02.6198338Z         else:
2025-05-07T20:33:02.6198567Z             scale_ub_tensor = None
2025-05-07T20:33:02.6198827Z     
2025-05-07T20:33:02.6199077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6199415Z             op = silu_mul_quant
2025-05-07T20:33:02.6199678Z             if compiled:
2025-05-07T20:33:02.6199942Z                 op = torch.compile(op)
2025-05-07T20:33:02.6200364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6200655Z     
2025-05-07T20:33:02.6200864Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.6201036Z 
2025-05-07T20:33:02.6201150Z moe/activation_test.py:117: 
2025-05-07T20:33:02.6201461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6201820Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.6202127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6202865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.6203589Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.6204160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6204884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6205575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6206143Z     kernel = self.compile(
2025-05-07T20:33:02.6206716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6207407Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6207825Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6208076Z 
2025-05-07T20:33:02.6208297Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780b0f450>
2025-05-07T20:33:02.6209433Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6210890Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7806796c0>}
2025-05-07T20:33:02.6212299Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6213535Z context = <triton._C.libtriton.ir.context object at 0x7ff780616e70>
2025-05-07T20:33:02.6213855Z 
2025-05-07T20:33:02.6214107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6214667Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6215165Z                            module_map=module_map)
2025-05-07T20:33:02.6215548Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6215924Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.6216198Z E       ^
2025-05-07T20:33:02.6216682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6217223Z 
2025-05-07T20:33:02.6217663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6218206Z 
2025-05-07T20:33:02.6218318Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6218767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6219280Z     T=4096,
2025-05-07T20:33:02.6219488Z     D=5120,
2025-05-07T20:33:02.6219776Z     scale_ub=1200.0,
2025-05-07T20:33:02.6220011Z     contiguous=False,
2025-05-07T20:33:02.6220250Z     compiled=True,
2025-05-07T20:33:02.6220466Z )
2025-05-07T20:33:02.6220796Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.6221320Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.6221614Z 
2025-05-07T20:33:02.6221697Z     @given(
2025-05-07T20:33:02.6221942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.6222270Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.6222593Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.6222939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.6223284Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.6223586Z     )
2025-05-07T20:33:02.6223960Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.6224421Z     def test_silu_mul_quant(
2025-05-07T20:33:02.6224680Z         self,
2025-05-07T20:33:02.6224885Z         T: int,
2025-05-07T20:33:02.6225099Z         D: int,
2025-05-07T20:33:02.6225334Z         scale_ub: Optional[float],
2025-05-07T20:33:02.6225616Z         contiguous: bool,
2025-05-07T20:33:02.6225875Z         compiled: bool,
2025-05-07T20:33:02.6226111Z     ) -> None:
2025-05-07T20:33:02.6226334Z         torch.manual_seed(2025)
2025-05-07T20:33:02.6226590Z     
2025-05-07T20:33:02.6226879Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.6227236Z     
2025-05-07T20:33:02.6227445Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.6227759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.6228081Z         x = x_sign * x_clamp
2025-05-07T20:33:02.6228335Z         x0 = x[:, :D]
2025-05-07T20:33:02.6228571Z         x1 = x[:, D:]
2025-05-07T20:33:02.6228786Z     
2025-05-07T20:33:02.6228986Z         if contiguous:
2025-05-07T20:33:02.6229235Z             x0 = x0.contiguous()
2025-05-07T20:33:02.6229503Z             x1 = x1.contiguous()
2025-05-07T20:33:02.6229757Z     
2025-05-07T20:33:02.6229961Z         if scale_ub is not None:
2025-05-07T20:33:02.6230242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.6230597Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.6230921Z             )
2025-05-07T20:33:02.6231126Z         else:
2025-05-07T20:33:02.6231343Z             scale_ub_tensor = None
2025-05-07T20:33:02.6231607Z     
2025-05-07T20:33:02.6231858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.6232184Z             op = silu_mul_quant
2025-05-07T20:33:02.6232449Z             if compiled:
2025-05-07T20:33:02.6232712Z                 op = torch.compile(op)
2025-05-07T20:33:02.6233021Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6233318Z     
2025-05-07T20:33:02.6233537Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.6233793Z 
2025-05-07T20:33:02.6233903Z moe/activation_test.py:117: 
2025-05-07T20:33:02.6234220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6234573Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.6234869Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.6235449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.6236036Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.6236729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.6237492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.6238059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.6238820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.6239563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.6240192Z     kernel = self.compile(
2025-05-07T20:33:02.6240767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.6241458Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.6241878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.6242127Z 
2025-05-07T20:33:02.6242348Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c73d0>
2025-05-07T20:33:02.6243484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.6244988Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78067afc0>}
2025-05-07T20:33:02.6246391Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.6247455Z context = <triton._C.libtriton.ir.context object at 0x7ff780ea0db0>
2025-05-07T20:33:02.6247766Z 
2025-05-07T20:33:02.6247942Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.6248498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.6248991Z                            module_map=module_map)
2025-05-07T20:33:02.6249375Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.6249754Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.6250031Z E       ^
2025-05-07T20:33:02.6250519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.6250998Z 
2025-05-07T20:33:02.6251435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.6251976Z 
2025-05-07T20:33:02.6252087Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.6252531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.6252957Z     T=2048,
2025-05-07T20:33:02.6253165Z     D=7168,
2025-05-07T20:33:02.6253377Z     scale_ub=1200.0,
2025-05-07T20:33:02.6253607Z     contiguous=False,
2025-05-07T20:33:02.6253845Z     compiled=False,
2025-05-07T20:33:02.8218061Z )
2025-05-07T20:33:02.8219234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8220061Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:02.8220499Z 
2025-05-07T20:33:02.8220939Z     @given(
2025-05-07T20:33:02.8221293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8221621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8221946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8222298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8222639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8222943Z     )
2025-05-07T20:33:02.8223312Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8223883Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8224140Z         self,
2025-05-07T20:33:02.8224355Z         T: int,
2025-05-07T20:33:02.8224558Z         D: int,
2025-05-07T20:33:02.8224790Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8225081Z         contiguous: bool,
2025-05-07T20:33:02.8225334Z         compiled: bool,
2025-05-07T20:33:02.8225659Z     ) -> None:
2025-05-07T20:33:02.8225889Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8226149Z     
2025-05-07T20:33:02.8226517Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8226882Z     
2025-05-07T20:33:02.8227094Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8227395Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8227735Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8227988Z         x0 = x[:, :D]
2025-05-07T20:33:02.8228214Z         x1 = x[:, D:]
2025-05-07T20:33:02.8228434Z     
2025-05-07T20:33:02.8228637Z         if contiguous:
2025-05-07T20:33:02.8228883Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8229168Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8229429Z     
2025-05-07T20:33:02.8229630Z         if scale_ub is not None:
2025-05-07T20:33:02.8229920Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8230276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8230611Z             )
2025-05-07T20:33:02.8230812Z         else:
2025-05-07T20:33:02.8231040Z             scale_ub_tensor = None
2025-05-07T20:33:02.8231307Z     
2025-05-07T20:33:02.8231549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8231885Z             op = silu_mul_quant
2025-05-07T20:33:02.8232157Z             if compiled:
2025-05-07T20:33:02.8232416Z                 op = torch.compile(op)
2025-05-07T20:33:02.8232735Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8233028Z     
2025-05-07T20:33:02.8233230Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8233434Z 
2025-05-07T20:33:02.8233556Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8233881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8234226Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8234528Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8235258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8235990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8236549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8237263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8237960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8238522Z     kernel = self.compile(
2025-05-07T20:33:02.8239085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8239777Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8240350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8240592Z 
2025-05-07T20:33:02.8240812Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780b0eb50>
2025-05-07T20:33:02.8241990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8243440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78067bec0>}
2025-05-07T20:33:02.8244844Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8245954Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfe22970>
2025-05-07T20:33:02.8246259Z 
2025-05-07T20:33:02.8246435Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8247032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8247569Z                            module_map=module_map)
2025-05-07T20:33:02.8247952Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8248327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8248602Z E       ^
2025-05-07T20:33:02.8249098Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8249568Z 
2025-05-07T20:33:02.8250006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8250552Z 
2025-05-07T20:33:02.8250668Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8251107Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8251534Z     T=1,
2025-05-07T20:33:02.8251730Z     D=7168,
2025-05-07T20:33:02.8251943Z     scale_ub=None,
2025-05-07T20:33:02.8252172Z     contiguous=True,
2025-05-07T20:33:02.8252408Z     compiled=False,
2025-05-07T20:33:02.8252633Z )
2025-05-07T20:33:02.8252969Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8253515Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:02.8253809Z 
2025-05-07T20:33:02.8253892Z     @given(
2025-05-07T20:33:02.8254138Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8254466Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8254793Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8255148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8255502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8255801Z     )
2025-05-07T20:33:02.8256174Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8256647Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8256899Z         self,
2025-05-07T20:33:02.8257110Z         T: int,
2025-05-07T20:33:02.8257325Z         D: int,
2025-05-07T20:33:02.8257552Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8257840Z         contiguous: bool,
2025-05-07T20:33:02.8258093Z         compiled: bool,
2025-05-07T20:33:02.8258324Z     ) -> None:
2025-05-07T20:33:02.8258551Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8258809Z     
2025-05-07T20:33:02.8259095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8259454Z     
2025-05-07T20:33:02.8259663Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8259972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8260301Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8260559Z         x0 = x[:, :D]
2025-05-07T20:33:02.8260802Z         x1 = x[:, D:]
2025-05-07T20:33:02.8269091Z     
2025-05-07T20:33:02.8269302Z         if contiguous:
2025-05-07T20:33:02.8269555Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8269835Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8270172Z     
2025-05-07T20:33:02.8270374Z         if scale_ub is not None:
2025-05-07T20:33:02.8270667Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8271028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8271350Z             )
2025-05-07T20:33:02.8271557Z         else:
2025-05-07T20:33:02.8271789Z             scale_ub_tensor = None
2025-05-07T20:33:02.8272050Z     
2025-05-07T20:33:02.8272302Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8272641Z             op = silu_mul_quant
2025-05-07T20:33:02.8272960Z             if compiled:
2025-05-07T20:33:02.8273214Z                 op = torch.compile(op)
2025-05-07T20:33:02.8273527Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8273821Z     
2025-05-07T20:33:02.8274017Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8274200Z 
2025-05-07T20:33:02.8274348Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8274666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8275054Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8275355Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8276086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8276815Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8277372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8278092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8278787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8279338Z     kernel = self.compile(
2025-05-07T20:33:02.8279905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8280710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8281128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8281366Z 
2025-05-07T20:33:02.8281582Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c54d0>
2025-05-07T20:33:02.8282713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8284148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780e4ccc0>}
2025-05-07T20:33:02.8285551Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8286622Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfe44a30>
2025-05-07T20:33:02.8286921Z 
2025-05-07T20:33:02.8287096Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8287648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8288130Z                            module_map=module_map)
2025-05-07T20:33:02.8288504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8288874Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8289150Z E       ^
2025-05-07T20:33:02.8289635Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8290102Z 
2025-05-07T20:33:02.8290535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.8291074Z 
2025-05-07T20:33:02.8291232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.8291668Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.8292085Z     T=16384,
2025-05-07T20:33:02.8292282Z     D=7168,
2025-05-07T20:33:02.8292488Z     scale_ub=1200.0,
2025-05-07T20:33:02.8292725Z     contiguous=False,
2025-05-07T20:33:02.8292958Z     compiled=True,
2025-05-07T20:33:02.8293173Z )
2025-05-07T20:33:02.8293557Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.8294076Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:02.8294418Z 
2025-05-07T20:33:02.8294499Z     @given(
2025-05-07T20:33:02.8294741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.8295062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.8295388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.8295776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.8296156Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.8296461Z     )
2025-05-07T20:33:02.8296830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.8297290Z     def test_silu_mul_quant(
2025-05-07T20:33:02.8297538Z         self,
2025-05-07T20:33:02.8297748Z         T: int,
2025-05-07T20:33:02.8297957Z         D: int,
2025-05-07T20:33:02.8298180Z         scale_ub: Optional[float],
2025-05-07T20:33:02.8298468Z         contiguous: bool,
2025-05-07T20:33:02.8298721Z         compiled: bool,
2025-05-07T20:33:02.8298951Z     ) -> None:
2025-05-07T20:33:02.8299178Z         torch.manual_seed(2025)
2025-05-07T20:33:02.8299464Z     
2025-05-07T20:33:02.8299759Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.8300107Z     
2025-05-07T20:33:02.8300312Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.8300631Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.8300954Z         x = x_sign * x_clamp
2025-05-07T20:33:02.8301213Z         x0 = x[:, :D]
2025-05-07T20:33:02.8301448Z         x1 = x[:, D:]
2025-05-07T20:33:02.8301664Z     
2025-05-07T20:33:02.8301866Z         if contiguous:
2025-05-07T20:33:02.8302113Z             x0 = x0.contiguous()
2025-05-07T20:33:02.8302385Z             x1 = x1.contiguous()
2025-05-07T20:33:02.8302648Z     
2025-05-07T20:33:02.8302856Z         if scale_ub is not None:
2025-05-07T20:33:02.8303137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.8303495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.8303834Z             )
2025-05-07T20:33:02.8304035Z         else:
2025-05-07T20:33:02.8304263Z             scale_ub_tensor = None
2025-05-07T20:33:02.8304532Z     
2025-05-07T20:33:02.8304783Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.8305110Z             op = silu_mul_quant
2025-05-07T20:33:02.8305382Z             if compiled:
2025-05-07T20:33:02.8305648Z                 op = torch.compile(op)
2025-05-07T20:33:02.8305957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8306252Z     
2025-05-07T20:33:02.8306460Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.8306631Z 
2025-05-07T20:33:02.8306736Z moe/activation_test.py:117: 
2025-05-07T20:33:02.8307048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8307404Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.8307696Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.8308283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.8308875Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.8309567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.8310279Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.8310898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.8311613Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.8312308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.8312857Z     kernel = self.compile(
2025-05-07T20:33:02.8313849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.8314560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.8315129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.8315376Z 
2025-05-07T20:33:02.8315592Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff43d50>
2025-05-07T20:33:02.8316840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.8318345Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780e4e0c0>}
2025-05-07T20:33:02.8319745Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.8320858Z context = <triton._C.libtriton.ir.context object at 0x7ff5cffa1cf0>
2025-05-07T20:33:02.8321172Z 
2025-05-07T20:33:02.8321347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.8321902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.8322397Z                            module_map=module_map)
2025-05-07T20:33:02.8322781Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.8323158Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.8323465Z E       ^
2025-05-07T20:33:02.8323963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.8324442Z 
2025-05-07T20:33:02.8324874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.9644961Z 
2025-05-07T20:33:02.9645696Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9647013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9648179Z     T=1,
2025-05-07T20:33:02.9648572Z     D=7168,
2025-05-07T20:33:02.9649026Z     scale_ub=None,
2025-05-07T20:33:02.9649470Z     contiguous=False,
2025-05-07T20:33:02.9649946Z     compiled=False,
2025-05-07T20:33:02.9650381Z )
2025-05-07T20:33:02.9651049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9652059Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:02.9652611Z 
2025-05-07T20:33:02.9652770Z     @given(
2025-05-07T20:33:02.9653251Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9653735Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9654103Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9654463Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9654808Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9655121Z     )
2025-05-07T20:33:02.9655497Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9655971Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9656224Z         self,
2025-05-07T20:33:02.9656433Z         T: int,
2025-05-07T20:33:02.9656655Z         D: int,
2025-05-07T20:33:02.9656894Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9657429Z         contiguous: bool,
2025-05-07T20:33:02.9657698Z         compiled: bool,
2025-05-07T20:33:02.9657940Z     ) -> None:
2025-05-07T20:33:02.9658172Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9658422Z     
2025-05-07T20:33:02.9658713Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9659074Z     
2025-05-07T20:33:02.9659276Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9659587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9659918Z         x = x_sign * x_clamp
2025-05-07T20:33:02.9660249Z         x0 = x[:, :D]
2025-05-07T20:33:02.9660483Z         x1 = x[:, D:]
2025-05-07T20:33:02.9660709Z     
2025-05-07T20:33:02.9660902Z         if contiguous:
2025-05-07T20:33:02.9661148Z             x0 = x0.contiguous()
2025-05-07T20:33:02.9661420Z             x1 = x1.contiguous()
2025-05-07T20:33:02.9661667Z     
2025-05-07T20:33:02.9661984Z         if scale_ub is not None:
2025-05-07T20:33:02.9662279Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.9662696Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.9663028Z             )
2025-05-07T20:33:02.9663233Z         else:
2025-05-07T20:33:02.9663452Z             scale_ub_tensor = None
2025-05-07T20:33:02.9663720Z     
2025-05-07T20:33:02.9663969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.9664301Z             op = silu_mul_quant
2025-05-07T20:33:02.9664562Z             if compiled:
2025-05-07T20:33:02.9664826Z                 op = torch.compile(op)
2025-05-07T20:33:02.9665145Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9665430Z     
2025-05-07T20:33:02.9665651Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.9665831Z 
2025-05-07T20:33:02.9665938Z moe/activation_test.py:117: 
2025-05-07T20:33:02.9666254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9666604Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.9666908Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9667633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.9668354Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.9668916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.9669638Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.9670336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.9670900Z     kernel = self.compile(
2025-05-07T20:33:02.9671465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.9672157Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.9672589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9672828Z 
2025-05-07T20:33:02.9673047Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c4550>
2025-05-07T20:33:02.9674175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.9675618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780e4ec00>}
2025-05-07T20:33:02.9677016Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.9678084Z context = <triton._C.libtriton.ir.context object at 0x7ff5cff2b1b0>
2025-05-07T20:33:02.9678436Z 
2025-05-07T20:33:02.9678614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.9679166Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.9679657Z                            module_map=module_map)
2025-05-07T20:33:02.9680037Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.9680558Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.9680835Z E       ^
2025-05-07T20:33:02.9681328Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.9681848Z 
2025-05-07T20:33:02.9682282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.9682825Z 
2025-05-07T20:33:02.9682938Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9683441Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9683945Z     T=2048,
2025-05-07T20:33:02.9684144Z     D=7168,
2025-05-07T20:33:02.9684356Z     scale_ub=None,
2025-05-07T20:33:02.9684593Z     contiguous=False,
2025-05-07T20:33:02.9684835Z     compiled=True,
2025-05-07T20:33:02.9685057Z )
2025-05-07T20:33:02.9685399Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:02.9685921Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:02.9686214Z 
2025-05-07T20:33:02.9686298Z     @given(
2025-05-07T20:33:02.9686554Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:02.9686885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:02.9687217Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:02.9687573Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:02.9687930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:02.9688237Z     )
2025-05-07T20:33:02.9688619Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:02.9689089Z     def test_silu_mul_quant(
2025-05-07T20:33:02.9689342Z         self,
2025-05-07T20:33:02.9689557Z         T: int,
2025-05-07T20:33:02.9689774Z         D: int,
2025-05-07T20:33:02.9690005Z         scale_ub: Optional[float],
2025-05-07T20:33:02.9690299Z         contiguous: bool,
2025-05-07T20:33:02.9690562Z         compiled: bool,
2025-05-07T20:33:02.9690797Z     ) -> None:
2025-05-07T20:33:02.9691036Z         torch.manual_seed(2025)
2025-05-07T20:33:02.9691300Z     
2025-05-07T20:33:02.9691589Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:02.9691955Z     
2025-05-07T20:33:02.9692171Z         x_sign = torch.sign(x)
2025-05-07T20:33:02.9692478Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:02.9692815Z         x = x_sign * x_clamp
2025-05-07T20:33:02.9693080Z         x0 = x[:, :D]
2025-05-07T20:33:02.9693312Z         x1 = x[:, D:]
2025-05-07T20:33:02.9693530Z     
2025-05-07T20:33:02.9693740Z         if contiguous:
2025-05-07T20:33:02.9693989Z             x0 = x0.contiguous()
2025-05-07T20:33:02.9694257Z             x1 = x1.contiguous()
2025-05-07T20:33:02.9694513Z     
2025-05-07T20:33:02.9694723Z         if scale_ub is not None:
2025-05-07T20:33:02.9695009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:02.9695374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:02.9695702Z             )
2025-05-07T20:33:02.9695908Z         else:
2025-05-07T20:33:02.9696139Z             scale_ub_tensor = None
2025-05-07T20:33:02.9696417Z     
2025-05-07T20:33:02.9696658Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:02.9696996Z             op = silu_mul_quant
2025-05-07T20:33:02.9697268Z             if compiled:
2025-05-07T20:33:02.9697530Z                 op = torch.compile(op)
2025-05-07T20:33:02.9697856Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9698202Z     
2025-05-07T20:33:02.9698424Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:02.9698599Z 
2025-05-07T20:33:02.9698705Z moe/activation_test.py:117: 
2025-05-07T20:33:02.9699016Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9699366Z moe/activation_test.py:115: in fn
2025-05-07T20:33:02.9699653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:02.9700237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:02.9700819Z     return fn(*args, **kwargs)
2025-05-07T20:33:02.9701547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:02.9702261Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:02.9702821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:02.9703610Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:02.9704297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:02.9704857Z     kernel = self.compile(
2025-05-07T20:33:02.9705424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:02.9706114Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:02.9706537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:02.9706784Z 
2025-05-07T20:33:02.9707000Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180c3d0>
2025-05-07T20:33:02.9708124Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:02.9709567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7811442c0>}
2025-05-07T20:33:02.9710955Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:02.9712016Z context = <triton._C.libtriton.ir.context object at 0x7ff7811d06b0>
2025-05-07T20:33:02.9712324Z 
2025-05-07T20:33:02.9712502Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:02.9713055Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:02.9713921Z                            module_map=module_map)
2025-05-07T20:33:02.9714309Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:02.9714685Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:02.9714954Z E       ^
2025-05-07T20:33:02.9715445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:02.9715923Z 
2025-05-07T20:33:02.9716359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:02.9716893Z 
2025-05-07T20:33:02.9717010Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:02.9717444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:02.9717877Z     T=4096,
2025-05-07T20:33:02.9718078Z     D=7168,
2025-05-07T20:33:02.9718275Z     scale_ub=None,
2025-05-07T20:33:02.9718506Z     contiguous=False,
2025-05-07T20:33:02.9718742Z     compiled=True,
2025-05-07T20:33:03.3793807Z )
2025-05-07T20:33:03.3794462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.3795562Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.3795938Z 
2025-05-07T20:33:03.3796036Z     @given(
2025-05-07T20:33:03.3796284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.3796615Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.3796931Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.3797279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.3797627Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.3797921Z     )
2025-05-07T20:33:03.3798288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.3798851Z     def test_silu_mul_quant(
2025-05-07T20:33:03.3799108Z         self,
2025-05-07T20:33:03.3799310Z         T: int,
2025-05-07T20:33:03.3799517Z         D: int,
2025-05-07T20:33:03.3799748Z         scale_ub: Optional[float],
2025-05-07T20:33:03.3800027Z         contiguous: bool,
2025-05-07T20:33:03.3800484Z         compiled: bool,
2025-05-07T20:33:03.3800731Z     ) -> None:
2025-05-07T20:33:03.3801035Z         torch.manual_seed(2025)
2025-05-07T20:33:03.3801294Z     
2025-05-07T20:33:03.3801586Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.3801943Z     
2025-05-07T20:33:03.3802149Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.3802461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.3802784Z         x = x_sign * x_clamp
2025-05-07T20:33:03.3803037Z         x0 = x[:, :D]
2025-05-07T20:33:03.3803267Z         x1 = x[:, D:]
2025-05-07T20:33:03.3803484Z     
2025-05-07T20:33:03.3803687Z         if contiguous:
2025-05-07T20:33:03.3803935Z             x0 = x0.contiguous()
2025-05-07T20:33:03.3804205Z             x1 = x1.contiguous()
2025-05-07T20:33:03.3804463Z     
2025-05-07T20:33:03.3804668Z         if scale_ub is not None:
2025-05-07T20:33:03.3804958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.3805314Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.3805645Z             )
2025-05-07T20:33:03.3805851Z         else:
2025-05-07T20:33:03.3806068Z             scale_ub_tensor = None
2025-05-07T20:33:03.3806338Z     
2025-05-07T20:33:03.3806585Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.3806914Z             op = silu_mul_quant
2025-05-07T20:33:03.3807184Z             if compiled:
2025-05-07T20:33:03.3807449Z                 op = torch.compile(op)
2025-05-07T20:33:03.3807760Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3808056Z     
2025-05-07T20:33:03.3808262Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.3808437Z 
2025-05-07T20:33:03.3808545Z moe/activation_test.py:117: 
2025-05-07T20:33:03.3808857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3809208Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.3809507Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3810101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.3810688Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.3811380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.3812095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.3812658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.3813633Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.3814386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.3814938Z     kernel = self.compile(
2025-05-07T20:33:03.3815526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.3816294Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.3816722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3816966Z 
2025-05-07T20:33:03.3817192Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff43550>
2025-05-07T20:33:03.3825946Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.3827446Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781144d60>}
2025-05-07T20:33:03.3828974Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.3830173Z context = <triton._C.libtriton.ir.context object at 0x7ff7811febb0>
2025-05-07T20:33:03.3830485Z 
2025-05-07T20:33:03.3830676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.3831230Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.3831729Z                            module_map=module_map)
2025-05-07T20:33:03.3832120Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.3832488Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.3832764Z E       ^
2025-05-07T20:33:03.3833260Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.3833731Z 
2025-05-07T20:33:03.3834174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.3834711Z 
2025-05-07T20:33:03.3834821Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.3835272Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.3835697Z     T=16384,
2025-05-07T20:33:03.3835898Z     D=5120,
2025-05-07T20:33:03.3836107Z     scale_ub=1200.0,
2025-05-07T20:33:03.3836349Z     contiguous=False,
2025-05-07T20:33:03.3836583Z     compiled=False,
2025-05-07T20:33:03.3836803Z )
2025-05-07T20:33:03.3837139Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.3837667Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:03.3837966Z 
2025-05-07T20:33:03.3838049Z     @given(
2025-05-07T20:33:03.3838291Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.3838629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.3838950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.3839298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.3839652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.3839947Z     )
2025-05-07T20:33:03.3840448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.3840917Z     def test_silu_mul_quant(
2025-05-07T20:33:03.3841173Z         self,
2025-05-07T20:33:03.3841374Z         T: int,
2025-05-07T20:33:03.3841586Z         D: int,
2025-05-07T20:33:03.3841820Z         scale_ub: Optional[float],
2025-05-07T20:33:03.3842107Z         contiguous: bool,
2025-05-07T20:33:03.3842366Z         compiled: bool,
2025-05-07T20:33:03.3842604Z     ) -> None:
2025-05-07T20:33:03.3842830Z         torch.manual_seed(2025)
2025-05-07T20:33:03.3843091Z     
2025-05-07T20:33:03.3843385Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.3843747Z     
2025-05-07T20:33:03.3843955Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.3844267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.3844590Z         x = x_sign * x_clamp
2025-05-07T20:33:03.3844900Z         x0 = x[:, :D]
2025-05-07T20:33:03.3845137Z         x1 = x[:, D:]
2025-05-07T20:33:03.3845357Z     
2025-05-07T20:33:03.3845560Z         if contiguous:
2025-05-07T20:33:03.3845807Z             x0 = x0.contiguous()
2025-05-07T20:33:03.3846079Z             x1 = x1.contiguous()
2025-05-07T20:33:03.3846340Z     
2025-05-07T20:33:03.3846548Z         if scale_ub is not None:
2025-05-07T20:33:03.3846837Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.3847197Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.3847534Z             )
2025-05-07T20:33:03.3847797Z         else:
2025-05-07T20:33:03.3848023Z             scale_ub_tensor = None
2025-05-07T20:33:03.3848297Z     
2025-05-07T20:33:03.3848547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.3848871Z             op = silu_mul_quant
2025-05-07T20:33:03.3849167Z             if compiled:
2025-05-07T20:33:03.3849482Z                 op = torch.compile(op)
2025-05-07T20:33:03.3849804Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3850129Z     
2025-05-07T20:33:03.3850343Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.3850516Z 
2025-05-07T20:33:03.3850629Z moe/activation_test.py:117: 
2025-05-07T20:33:03.3850937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3851292Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.3851590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3852310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.3853026Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.3853591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.3854310Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.3855007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.3855568Z     kernel = self.compile(
2025-05-07T20:33:03.3856135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.3856821Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.3857233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3857479Z 
2025-05-07T20:33:03.3857696Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180fa50>
2025-05-07T20:33:03.3858816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.3860247Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781145c60>}
2025-05-07T20:33:03.3861639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.3862709Z context = <triton._C.libtriton.ir.context object at 0x7ff7801f2570>
2025-05-07T20:33:03.3863019Z 
2025-05-07T20:33:03.3863196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.3863755Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.3864247Z                            module_map=module_map)
2025-05-07T20:33:03.3864639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.3865023Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.3865299Z E       ^
2025-05-07T20:33:03.3865835Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.3866319Z 
2025-05-07T20:33:03.3866751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.3867283Z 
2025-05-07T20:33:03.3867400Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.3867838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.3868256Z     T=16384,
2025-05-07T20:33:03.3868464Z     D=5120,
2025-05-07T20:33:03.3868670Z     scale_ub=1200.0,
2025-05-07T20:33:03.3868944Z     contiguous=True,
2025-05-07T20:33:03.3869181Z     compiled=True,
2025-05-07T20:33:03.3869401Z )
2025-05-07T20:33:03.3869732Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.3870260Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.3870595Z 
2025-05-07T20:33:03.3870684Z     @given(
2025-05-07T20:33:03.3870927Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.3871300Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.3871629Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.3871972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.3872320Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.3872628Z     )
2025-05-07T20:33:03.3873002Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.3873465Z     def test_silu_mul_quant(
2025-05-07T20:33:03.3873726Z         self,
2025-05-07T20:33:03.3873936Z         T: int,
2025-05-07T20:33:03.3874142Z         D: int,
2025-05-07T20:33:03.3874377Z         scale_ub: Optional[float],
2025-05-07T20:33:03.3874664Z         contiguous: bool,
2025-05-07T20:33:03.3874911Z         compiled: bool,
2025-05-07T20:33:03.3875146Z     ) -> None:
2025-05-07T20:33:03.3875382Z         torch.manual_seed(2025)
2025-05-07T20:33:03.3875632Z     
2025-05-07T20:33:03.3875927Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.3876285Z     
2025-05-07T20:33:03.3876484Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.3876793Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.3877119Z         x = x_sign * x_clamp
2025-05-07T20:33:03.3877373Z         x0 = x[:, :D]
2025-05-07T20:33:03.3877596Z         x1 = x[:, D:]
2025-05-07T20:33:03.3877816Z     
2025-05-07T20:33:03.3878007Z         if contiguous:
2025-05-07T20:33:03.3878242Z             x0 = x0.contiguous()
2025-05-07T20:33:03.3878508Z             x1 = x1.contiguous()
2025-05-07T20:33:03.3878762Z     
2025-05-07T20:33:03.3878960Z         if scale_ub is not None:
2025-05-07T20:33:03.3879251Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.3879611Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.3879934Z             )
2025-05-07T20:33:03.3880222Z         else:
2025-05-07T20:33:03.3880454Z             scale_ub_tensor = None
2025-05-07T20:33:03.3880715Z     
2025-05-07T20:33:03.3880968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.3881303Z             op = silu_mul_quant
2025-05-07T20:33:03.3881562Z             if compiled:
2025-05-07T20:33:03.3881827Z                 op = torch.compile(op)
2025-05-07T20:33:03.3882142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3882432Z     
2025-05-07T20:33:03.3882640Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.3882823Z 
2025-05-07T20:33:03.3882927Z moe/activation_test.py:117: 
2025-05-07T20:33:03.3883247Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3883613Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.3883942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.3884520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.3885111Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.3885850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.3886570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.3887125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.3887834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.3888528Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.3889127Z     kernel = self.compile(
2025-05-07T20:33:03.3889688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.3890379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.3890857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.3891095Z 
2025-05-07T20:33:03.3891372Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78118e850>
2025-05-07T20:33:03.3892500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.3893927Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781147380>}
2025-05-07T20:33:03.3895320Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.3896380Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfe54af0>
2025-05-07T20:33:03.3896687Z 
2025-05-07T20:33:03.3896865Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.3897413Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.3897904Z                            module_map=module_map)
2025-05-07T20:33:03.3898284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.3898648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.3898919Z E       ^
2025-05-07T20:33:03.3899402Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.3899870Z 
2025-05-07T20:33:03.3900301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5439895Z 
2025-05-07T20:33:03.5440399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5441054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5441664Z     T=16384,
2025-05-07T20:33:03.5441870Z     D=5120,
2025-05-07T20:33:03.5442084Z     scale_ub=None,
2025-05-07T20:33:03.5442313Z     contiguous=False,
2025-05-07T20:33:03.5442562Z     compiled=True,
2025-05-07T20:33:03.5442779Z )
2025-05-07T20:33:03.5443123Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5443653Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5443947Z 
2025-05-07T20:33:03.5444030Z     @given(
2025-05-07T20:33:03.5444276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5444616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5444946Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5445294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5445643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5445950Z     )
2025-05-07T20:33:03.5446620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5447091Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5447352Z         self,
2025-05-07T20:33:03.5447555Z         T: int,
2025-05-07T20:33:03.5447766Z         D: int,
2025-05-07T20:33:03.5447998Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5448283Z         contiguous: bool,
2025-05-07T20:33:03.5448538Z         compiled: bool,
2025-05-07T20:33:03.5448776Z     ) -> None:
2025-05-07T20:33:03.5448997Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5449254Z     
2025-05-07T20:33:03.5449546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5450001Z     
2025-05-07T20:33:03.5450206Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5450516Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5450848Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5451095Z         x0 = x[:, :D]
2025-05-07T20:33:03.5451411Z         x1 = x[:, D:]
2025-05-07T20:33:03.5451634Z     
2025-05-07T20:33:03.5451831Z         if contiguous:
2025-05-07T20:33:03.5452162Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5452442Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5452689Z     
2025-05-07T20:33:03.5452895Z         if scale_ub is not None:
2025-05-07T20:33:03.5453186Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5453538Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5453866Z             )
2025-05-07T20:33:03.5454075Z         else:
2025-05-07T20:33:03.5454292Z             scale_ub_tensor = None
2025-05-07T20:33:03.5454566Z     
2025-05-07T20:33:03.5454819Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5455148Z             op = silu_mul_quant
2025-05-07T20:33:03.5455421Z             if compiled:
2025-05-07T20:33:03.5455687Z                 op = torch.compile(op)
2025-05-07T20:33:03.5456004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5456296Z     
2025-05-07T20:33:03.5456505Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5456678Z 
2025-05-07T20:33:03.5456800Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5457109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5457465Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5457766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5458356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5458951Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5459649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5460373Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5460937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5461659Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5462359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5462918Z     kernel = self.compile(
2025-05-07T20:33:03.5463494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5464227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5464651Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5464892Z 
2025-05-07T20:33:03.5465113Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780741a50>
2025-05-07T20:33:03.5466246Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5467762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802d85e0>}
2025-05-07T20:33:03.5469163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5470234Z context = <triton._C.libtriton.ir.context object at 0x7ff780224430>
2025-05-07T20:33:03.5470538Z 
2025-05-07T20:33:03.5470714Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5471360Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5471855Z                            module_map=module_map)
2025-05-07T20:33:03.5472236Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5472658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5472934Z E       ^
2025-05-07T20:33:03.5473470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5473944Z 
2025-05-07T20:33:03.5474421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.5474957Z 
2025-05-07T20:33:03.5475065Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.5475503Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.5475930Z     T=2048,
2025-05-07T20:33:03.5476126Z     D=5120,
2025-05-07T20:33:03.5476331Z     scale_ub=None,
2025-05-07T20:33:03.5476562Z     contiguous=False,
2025-05-07T20:33:03.5476800Z     compiled=True,
2025-05-07T20:33:03.5477008Z )
2025-05-07T20:33:03.5477344Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.5477863Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:03.5478151Z 
2025-05-07T20:33:03.5478234Z     @given(
2025-05-07T20:33:03.5478482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.5478814Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.5479135Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.5479486Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.5479835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.5480227Z     )
2025-05-07T20:33:03.5480590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.5481059Z     def test_silu_mul_quant(
2025-05-07T20:33:03.5481316Z         self,
2025-05-07T20:33:03.5481517Z         T: int,
2025-05-07T20:33:03.5481728Z         D: int,
2025-05-07T20:33:03.5481963Z         scale_ub: Optional[float],
2025-05-07T20:33:03.5482249Z         contiguous: bool,
2025-05-07T20:33:03.5482506Z         compiled: bool,
2025-05-07T20:33:03.5482748Z     ) -> None:
2025-05-07T20:33:03.5482975Z         torch.manual_seed(2025)
2025-05-07T20:33:03.5483233Z     
2025-05-07T20:33:03.5483529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.5483891Z     
2025-05-07T20:33:03.5484103Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.5484412Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.5484738Z         x = x_sign * x_clamp
2025-05-07T20:33:03.5484992Z         x0 = x[:, :D]
2025-05-07T20:33:03.5485221Z         x1 = x[:, D:]
2025-05-07T20:33:03.5485437Z     
2025-05-07T20:33:03.5485635Z         if contiguous:
2025-05-07T20:33:03.5485883Z             x0 = x0.contiguous()
2025-05-07T20:33:03.5486157Z             x1 = x1.contiguous()
2025-05-07T20:33:03.5486405Z     
2025-05-07T20:33:03.5486610Z         if scale_ub is not None:
2025-05-07T20:33:03.5486897Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.5487247Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.5487575Z             )
2025-05-07T20:33:03.5487833Z         else:
2025-05-07T20:33:03.5488055Z             scale_ub_tensor = None
2025-05-07T20:33:03.5488322Z     
2025-05-07T20:33:03.5488567Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.5488895Z             op = silu_mul_quant
2025-05-07T20:33:03.5489163Z             if compiled:
2025-05-07T20:33:03.5489424Z                 op = torch.compile(op)
2025-05-07T20:33:03.5489732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5490027Z     
2025-05-07T20:33:03.5490234Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.5490455Z 
2025-05-07T20:33:03.5490567Z moe/activation_test.py:117: 
2025-05-07T20:33:03.5490874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5491226Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.5491524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.5492109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.5492780Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.5493477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.5494197Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.5494765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.5495482Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.5496180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.5496735Z     kernel = self.compile(
2025-05-07T20:33:03.5497304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.5497992Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.5498425Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.5498667Z 
2025-05-07T20:33:03.5498884Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780fe9f50>
2025-05-07T20:33:03.5500010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.5501446Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802d9440>}
2025-05-07T20:33:03.5502847Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.5503909Z context = <triton._C.libtriton.ir.context object at 0x7ff78027be30>
2025-05-07T20:33:03.5504221Z 
2025-05-07T20:33:03.5504400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.5504956Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.5505451Z                            module_map=module_map)
2025-05-07T20:33:03.5505834Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.5506213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.5506492Z E       ^
2025-05-07T20:33:03.5506978Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.5507458Z 
2025-05-07T20:33:03.5507893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7118910Z 
2025-05-07T20:33:03.7119202Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7120152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7120599Z     T=2048,
2025-05-07T20:33:03.7120803Z     D=5120,
2025-05-07T20:33:03.7121010Z     scale_ub=1200.0,
2025-05-07T20:33:03.7121247Z     contiguous=False,
2025-05-07T20:33:03.7121489Z     compiled=True,
2025-05-07T20:33:03.7121709Z )
2025-05-07T20:33:03.7122043Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7122571Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.7122867Z 
2025-05-07T20:33:03.7122948Z     @given(
2025-05-07T20:33:03.7123305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7123633Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7123963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7124314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7124659Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7125046Z     )
2025-05-07T20:33:03.7125493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7125955Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7126227Z         self,
2025-05-07T20:33:03.7126439Z         T: int,
2025-05-07T20:33:03.7126646Z         D: int,
2025-05-07T20:33:03.7126881Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7127168Z         contiguous: bool,
2025-05-07T20:33:03.7127420Z         compiled: bool,
2025-05-07T20:33:03.7134716Z     ) -> None:
2025-05-07T20:33:03.7134983Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7135250Z     
2025-05-07T20:33:03.7135548Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7135914Z     
2025-05-07T20:33:03.7136118Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7136431Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7136765Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7137022Z         x0 = x[:, :D]
2025-05-07T20:33:03.7137259Z         x1 = x[:, D:]
2025-05-07T20:33:03.7137486Z     
2025-05-07T20:33:03.7137684Z         if contiguous:
2025-05-07T20:33:03.7137935Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7138214Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7138466Z     
2025-05-07T20:33:03.7138674Z         if scale_ub is not None:
2025-05-07T20:33:03.7138970Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7139337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7139663Z             )
2025-05-07T20:33:03.7139874Z         else:
2025-05-07T20:33:03.7140105Z             scale_ub_tensor = None
2025-05-07T20:33:03.7140367Z     
2025-05-07T20:33:03.7140648Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7140984Z             op = silu_mul_quant
2025-05-07T20:33:03.7141258Z             if compiled:
2025-05-07T20:33:03.7141519Z                 op = torch.compile(op)
2025-05-07T20:33:03.7141843Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7142141Z     
2025-05-07T20:33:03.7142345Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7142527Z 
2025-05-07T20:33:03.7142635Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7142952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7143303Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7143618Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7144251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.7144846Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.7145539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7146256Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7146828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7147625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7148330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7148889Z     kernel = self.compile(
2025-05-07T20:33:03.7149467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7150149Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7150574Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7150863Z 
2025-05-07T20:33:03.7151091Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff6fed0>
2025-05-07T20:33:03.7152227Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7153758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802da660>}
2025-05-07T20:33:03.7155210Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7156283Z context = <triton._C.libtriton.ir.context object at 0x7ff7802f73f0>
2025-05-07T20:33:03.7156586Z 
2025-05-07T20:33:03.7156775Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7157324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7157823Z                            module_map=module_map)
2025-05-07T20:33:03.7158213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7158596Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7158868Z E       ^
2025-05-07T20:33:03.7159360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7159830Z 
2025-05-07T20:33:03.7160386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.7160925Z 
2025-05-07T20:33:03.7161044Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.7161479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.7161910Z     T=4096,
2025-05-07T20:33:03.7162117Z     D=5120,
2025-05-07T20:33:03.7162318Z     scale_ub=1200.0,
2025-05-07T20:33:03.7162563Z     contiguous=True,
2025-05-07T20:33:03.7162804Z     compiled=True,
2025-05-07T20:33:03.7163016Z )
2025-05-07T20:33:03.7163358Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.7163891Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.7164176Z 
2025-05-07T20:33:03.7164259Z     @given(
2025-05-07T20:33:03.7164508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.7164844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.7165173Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.7165517Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.7165867Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.7166170Z     )
2025-05-07T20:33:03.7166535Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.7167006Z     def test_silu_mul_quant(
2025-05-07T20:33:03.7167265Z         self,
2025-05-07T20:33:03.7167467Z         T: int,
2025-05-07T20:33:03.7167680Z         D: int,
2025-05-07T20:33:03.7167920Z         scale_ub: Optional[float],
2025-05-07T20:33:03.7168211Z         contiguous: bool,
2025-05-07T20:33:03.7168471Z         compiled: bool,
2025-05-07T20:33:03.7168765Z     ) -> None:
2025-05-07T20:33:03.7168993Z         torch.manual_seed(2025)
2025-05-07T20:33:03.7169253Z     
2025-05-07T20:33:03.7169545Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.7169904Z     
2025-05-07T20:33:03.7170106Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.7170419Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.7170751Z         x = x_sign * x_clamp
2025-05-07T20:33:03.7171004Z         x0 = x[:, :D]
2025-05-07T20:33:03.7171238Z         x1 = x[:, D:]
2025-05-07T20:33:03.7171509Z     
2025-05-07T20:33:03.7171703Z         if contiguous:
2025-05-07T20:33:03.7171952Z             x0 = x0.contiguous()
2025-05-07T20:33:03.7172230Z             x1 = x1.contiguous()
2025-05-07T20:33:03.7172481Z     
2025-05-07T20:33:03.7172690Z         if scale_ub is not None:
2025-05-07T20:33:03.7172987Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.7173391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.7173724Z             )
2025-05-07T20:33:03.7173977Z         else:
2025-05-07T20:33:03.7174200Z             scale_ub_tensor = None
2025-05-07T20:33:03.7174471Z     
2025-05-07T20:33:03.7174721Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.7175058Z             op = silu_mul_quant
2025-05-07T20:33:03.7175322Z             if compiled:
2025-05-07T20:33:03.7175590Z                 op = torch.compile(op)
2025-05-07T20:33:03.7175910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7176197Z     
2025-05-07T20:33:03.7176412Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.7176585Z 
2025-05-07T20:33:03.7176699Z moe/activation_test.py:117: 
2025-05-07T20:33:03.7177009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7177368Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.7177671Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.7178263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.7178849Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.7179544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.7180269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.7180831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.7181548Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.7182251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.7182812Z     kernel = self.compile(
2025-05-07T20:33:03.7183378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.7184127Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.7184551Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.7184793Z 
2025-05-07T20:33:03.7185010Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780741f50>
2025-05-07T20:33:03.7186139Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.7187568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802db9c0>}
2025-05-07T20:33:03.7188968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.7190092Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfc72c30>
2025-05-07T20:33:03.7190398Z 
2025-05-07T20:33:03.7190574Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.7191135Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.7191628Z                            module_map=module_map)
2025-05-07T20:33:03.7192018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.7192390Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.7192735Z E       ^
2025-05-07T20:33:03.7193225Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.7193697Z 
2025-05-07T20:33:03.7194133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.8885168Z 
2025-05-07T20:33:03.8885574Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.8886542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.8887105Z     T=128,
2025-05-07T20:33:03.8887340Z     D=5120,
2025-05-07T20:33:03.8887551Z     scale_ub=1200.0,
2025-05-07T20:33:03.8887798Z     contiguous=False,
2025-05-07T20:33:03.8888035Z     compiled=True,
2025-05-07T20:33:03.8888259Z )
2025-05-07T20:33:03.8888604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.8889136Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:03.8889431Z 
2025-05-07T20:33:03.8889515Z     @given(
2025-05-07T20:33:03.8889768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.8890100Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.8890423Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.8890781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.8891147Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.8891454Z     )
2025-05-07T20:33:03.8891824Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.8892294Z     def test_silu_mul_quant(
2025-05-07T20:33:03.8892554Z         self,
2025-05-07T20:33:03.8892760Z         T: int,
2025-05-07T20:33:03.8892974Z         D: int,
2025-05-07T20:33:03.8893211Z         scale_ub: Optional[float],
2025-05-07T20:33:03.8893497Z         contiguous: bool,
2025-05-07T20:33:03.8893759Z         compiled: bool,
2025-05-07T20:33:03.8894001Z     ) -> None:
2025-05-07T20:33:03.8894234Z         torch.manual_seed(2025)
2025-05-07T20:33:03.8894493Z     
2025-05-07T20:33:03.8894784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.8895139Z     
2025-05-07T20:33:03.8895346Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.8895658Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.8895989Z         x = x_sign * x_clamp
2025-05-07T20:33:03.8896250Z         x0 = x[:, :D]
2025-05-07T20:33:03.8896485Z         x1 = x[:, D:]
2025-05-07T20:33:03.8896709Z     
2025-05-07T20:33:03.8896904Z         if contiguous:
2025-05-07T20:33:03.8897154Z             x0 = x0.contiguous()
2025-05-07T20:33:03.8897431Z             x1 = x1.contiguous()
2025-05-07T20:33:03.8897683Z     
2025-05-07T20:33:03.8897891Z         if scale_ub is not None:
2025-05-07T20:33:03.8898184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.8898539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.8898883Z             )
2025-05-07T20:33:03.8899091Z         else:
2025-05-07T20:33:03.8899322Z             scale_ub_tensor = None
2025-05-07T20:33:03.8899588Z     
2025-05-07T20:33:03.8899839Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.8900174Z             op = silu_mul_quant
2025-05-07T20:33:03.8900439Z             if compiled:
2025-05-07T20:33:03.8900709Z                 op = torch.compile(op)
2025-05-07T20:33:03.8901140Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8901438Z     
2025-05-07T20:33:03.8901646Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.8901830Z 
2025-05-07T20:33:03.8901937Z moe/activation_test.py:117: 
2025-05-07T20:33:03.8902255Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8902609Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.8902913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8903532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.8904266Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.8904956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.8905682Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.8906369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8907127Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8907820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8908388Z     kernel = self.compile(
2025-05-07T20:33:03.8908960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8909652Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8910084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8910330Z 
2025-05-07T20:33:03.8910549Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78118d3d0>
2025-05-07T20:33:03.8911680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8913142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfcd0fe0>}
2025-05-07T20:33:03.8914812Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8915879Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfcc6170>
2025-05-07T20:33:03.8916194Z 
2025-05-07T20:33:03.8916374Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8916926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8917419Z                            module_map=module_map)
2025-05-07T20:33:03.8917818Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8918200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.8918473Z E       ^
2025-05-07T20:33:03.8918966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8919445Z 
2025-05-07T20:33:03.8919878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:03.8920471Z 
2025-05-07T20:33:03.8920589Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:03.8921024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:03.8921454Z     T=16384,
2025-05-07T20:33:03.8921666Z     D=7168,
2025-05-07T20:33:03.8921869Z     scale_ub=1200.0,
2025-05-07T20:33:03.8922108Z     contiguous=True,
2025-05-07T20:33:03.8922351Z     compiled=True,
2025-05-07T20:33:03.8922562Z )
2025-05-07T20:33:03.8922905Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:03.8923518Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:03.8923818Z 
2025-05-07T20:33:03.8923909Z     @given(
2025-05-07T20:33:03.8924152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:03.8924490Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:03.8924819Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:03.8925167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:03.8925519Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:03.8925889Z     )
2025-05-07T20:33:03.8926254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:03.8926721Z     def test_silu_mul_quant(
2025-05-07T20:33:03.8926982Z         self,
2025-05-07T20:33:03.8927194Z         T: int,
2025-05-07T20:33:03.8927404Z         D: int,
2025-05-07T20:33:03.8927703Z         scale_ub: Optional[float],
2025-05-07T20:33:03.8927997Z         contiguous: bool,
2025-05-07T20:33:03.8928311Z         compiled: bool,
2025-05-07T20:33:03.8928554Z     ) -> None:
2025-05-07T20:33:03.8928787Z         torch.manual_seed(2025)
2025-05-07T20:33:03.8929041Z     
2025-05-07T20:33:03.8929328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:03.8929689Z     
2025-05-07T20:33:03.8929894Z         x_sign = torch.sign(x)
2025-05-07T20:33:03.8930205Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:03.8930538Z         x = x_sign * x_clamp
2025-05-07T20:33:03.8930789Z         x0 = x[:, :D]
2025-05-07T20:33:03.8931031Z         x1 = x[:, D:]
2025-05-07T20:33:03.8931259Z     
2025-05-07T20:33:03.8931452Z         if contiguous:
2025-05-07T20:33:03.8931700Z             x0 = x0.contiguous()
2025-05-07T20:33:03.8931977Z             x1 = x1.contiguous()
2025-05-07T20:33:03.8932228Z     
2025-05-07T20:33:03.8932439Z         if scale_ub is not None:
2025-05-07T20:33:03.8932738Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:03.8933103Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:03.8933427Z             )
2025-05-07T20:33:03.8933635Z         else:
2025-05-07T20:33:03.8933864Z             scale_ub_tensor = None
2025-05-07T20:33:03.8934127Z     
2025-05-07T20:33:03.8934374Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:03.8934709Z             op = silu_mul_quant
2025-05-07T20:33:03.8934973Z             if compiled:
2025-05-07T20:33:03.8935244Z                 op = torch.compile(op)
2025-05-07T20:33:03.8935564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8935856Z     
2025-05-07T20:33:03.8936068Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:03.8936244Z 
2025-05-07T20:33:03.8936355Z moe/activation_test.py:117: 
2025-05-07T20:33:03.8936665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8937030Z moe/activation_test.py:115: in fn
2025-05-07T20:33:03.8937340Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:03.8937936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:03.8938518Z     return fn(*args, **kwargs)
2025-05-07T20:33:03.8939211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:03.8939934Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:03.8940494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:03.8941214Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:03.8941912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:03.8942475Z     kernel = self.compile(
2025-05-07T20:33:03.8943098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:03.8943842Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:03.8944273Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:03.8944515Z 
2025-05-07T20:33:03.8944739Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780a24bd0>
2025-05-07T20:33:03.8945857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:03.8947334Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfcd1e40>}
2025-05-07T20:33:03.8948734Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:03.8949884Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfd41970>
2025-05-07T20:33:03.8950192Z 
2025-05-07T20:33:03.8950375Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:03.8950919Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:03.8951413Z                            module_map=module_map)
2025-05-07T20:33:03.8951801Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:03.8952175Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:03.8952452Z E       ^
2025-05-07T20:33:03.8952943Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:03.8953412Z 
2025-05-07T20:33:03.8953853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0115956Z 
2025-05-07T20:33:04.0116349Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0116975Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0117453Z     T=16384,
2025-05-07T20:33:04.0117666Z     D=5120,
2025-05-07T20:33:04.0117879Z     scale_ub=1200.0,
2025-05-07T20:33:04.0118128Z     contiguous=True,
2025-05-07T20:33:04.0118376Z     compiled=False,
2025-05-07T20:33:04.0118606Z )
2025-05-07T20:33:04.0118976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0119573Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.0119915Z 
2025-05-07T20:33:04.0119999Z     @given(
2025-05-07T20:33:04.0120372Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0120706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0121032Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0121389Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0121739Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0122047Z     )
2025-05-07T20:33:04.0122414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0122877Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0123139Z         self,
2025-05-07T20:33:04.0123344Z         T: int,
2025-05-07T20:33:04.0123558Z         D: int,
2025-05-07T20:33:04.0123793Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0124077Z         contiguous: bool,
2025-05-07T20:33:04.0124336Z         compiled: bool,
2025-05-07T20:33:04.0124580Z     ) -> None:
2025-05-07T20:33:04.0124806Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0125064Z     
2025-05-07T20:33:04.0125358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0125714Z     
2025-05-07T20:33:04.0125924Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0126240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0126848Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0127104Z         x0 = x[:, :D]
2025-05-07T20:33:04.0127342Z         x1 = x[:, D:]
2025-05-07T20:33:04.0127564Z     
2025-05-07T20:33:04.0127757Z         if contiguous:
2025-05-07T20:33:04.0128005Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0128280Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0128527Z     
2025-05-07T20:33:04.0128730Z         if scale_ub is not None:
2025-05-07T20:33:04.0129023Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0129373Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0129779Z             )
2025-05-07T20:33:04.0129986Z         else:
2025-05-07T20:33:04.0130204Z             scale_ub_tensor = None
2025-05-07T20:33:04.0130473Z     
2025-05-07T20:33:04.0130750Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0131177Z             op = silu_mul_quant
2025-05-07T20:33:04.0131447Z             if compiled:
2025-05-07T20:33:04.0131787Z                 op = torch.compile(op)
2025-05-07T20:33:04.0132105Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0132397Z     
2025-05-07T20:33:04.0132599Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0132778Z 
2025-05-07T20:33:04.0132882Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0133196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0133550Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0133843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0134569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0135292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0135853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0136577Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0137278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0137838Z     kernel = self.compile(
2025-05-07T20:33:04.0138401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0139092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0139533Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0139778Z 
2025-05-07T20:33:04.0140002Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78015d2d0>
2025-05-07T20:33:04.0141120Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0142588Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfcd2ca0>}
2025-05-07T20:33:04.0144046Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0153046Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfdf34b0>
2025-05-07T20:33:04.0153374Z 
2025-05-07T20:33:04.0153560Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0154134Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0154629Z                            module_map=module_map)
2025-05-07T20:33:04.0155026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0155411Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0155684Z E       ^
2025-05-07T20:33:04.0156283Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0156768Z 
2025-05-07T20:33:04.0157213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0157755Z 
2025-05-07T20:33:04.0157875Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0158315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0158749Z     T=1,
2025-05-07T20:33:04.0159005Z     D=7168,
2025-05-07T20:33:04.0159221Z     scale_ub=1200.0,
2025-05-07T20:33:04.0159458Z     contiguous=False,
2025-05-07T20:33:04.0159712Z     compiled=False,
2025-05-07T20:33:04.0159936Z )
2025-05-07T20:33:04.0160393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.0160974Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.0161262Z 
2025-05-07T20:33:04.0161357Z     @given(
2025-05-07T20:33:04.0161644Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.0161983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.0162313Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.0162664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.0163022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.0163332Z     )
2025-05-07T20:33:04.0163709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.0164177Z     def test_silu_mul_quant(
2025-05-07T20:33:04.0164441Z         self,
2025-05-07T20:33:04.0164656Z         T: int,
2025-05-07T20:33:04.0164866Z         D: int,
2025-05-07T20:33:04.0165108Z         scale_ub: Optional[float],
2025-05-07T20:33:04.0165404Z         contiguous: bool,
2025-05-07T20:33:04.0165662Z         compiled: bool,
2025-05-07T20:33:04.0165905Z     ) -> None:
2025-05-07T20:33:04.0166143Z         torch.manual_seed(2025)
2025-05-07T20:33:04.0166400Z     
2025-05-07T20:33:04.0166696Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.0167062Z     
2025-05-07T20:33:04.0167266Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.0167582Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.0167917Z         x = x_sign * x_clamp
2025-05-07T20:33:04.0168168Z         x0 = x[:, :D]
2025-05-07T20:33:04.0168405Z         x1 = x[:, D:]
2025-05-07T20:33:04.0168628Z     
2025-05-07T20:33:04.0168825Z         if contiguous:
2025-05-07T20:33:04.0169078Z             x0 = x0.contiguous()
2025-05-07T20:33:04.0169355Z             x1 = x1.contiguous()
2025-05-07T20:33:04.0169614Z     
2025-05-07T20:33:04.0169815Z         if scale_ub is not None:
2025-05-07T20:33:04.0170109Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.0170469Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.0170804Z             )
2025-05-07T20:33:04.0171018Z         else:
2025-05-07T20:33:04.0171250Z             scale_ub_tensor = None
2025-05-07T20:33:04.0171516Z     
2025-05-07T20:33:04.0171770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.0172111Z             op = silu_mul_quant
2025-05-07T20:33:04.0172379Z             if compiled:
2025-05-07T20:33:04.0172649Z                 op = torch.compile(op)
2025-05-07T20:33:04.0172976Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0173266Z     
2025-05-07T20:33:04.0173481Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.0173673Z 
2025-05-07T20:33:04.0173780Z moe/activation_test.py:117: 
2025-05-07T20:33:04.0174101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
﻿2025-05-07T20:33:04.0177290Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.0177600Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.0178389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.0179251Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.0179903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.0180742Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.0181549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.0182202Z     kernel = self.compile(
2025-05-07T20:33:04.0182861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.0183665Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.0184158Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.0184458Z 
2025-05-07T20:33:04.0184681Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa7afd0>
2025-05-07T20:33:04.0185856Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.0187288Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0c0e0>}
2025-05-07T20:33:04.0188686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.0189753Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfde9d30>
2025-05-07T20:33:04.0190060Z 
2025-05-07T20:33:04.0190245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.0190812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.0191303Z                            module_map=module_map)
2025-05-07T20:33:04.0191695Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.0192073Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.0192348Z E       ^
2025-05-07T20:33:04.0192842Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.0193312Z 
2025-05-07T20:33:04.0193752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.0194287Z 
2025-05-07T20:33:04.0194404Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.0194838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.0195269Z     T=4096,
2025-05-07T20:33:04.0195480Z     D=7168,
2025-05-07T20:33:04.0195682Z     scale_ub=1200.0,
2025-05-07T20:33:04.0195926Z     contiguous=False,
2025-05-07T20:33:04.0196172Z     compiled=True,
2025-05-07T20:33:04.1805457Z )
2025-05-07T20:33:04.1806022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.1806650Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:04.1806939Z 
2025-05-07T20:33:04.1807036Z     @given(
2025-05-07T20:33:04.1807282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.1807625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.1807961Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.1808314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.1808677Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.1809250Z     )
2025-05-07T20:33:04.1809633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.1810110Z     def test_silu_mul_quant(
2025-05-07T20:33:04.1810472Z         self,
2025-05-07T20:33:04.1810705Z         T: int,
2025-05-07T20:33:04.1810918Z         D: int,
2025-05-07T20:33:04.1811160Z         scale_ub: Optional[float],
2025-05-07T20:33:04.1811458Z         contiguous: bool,
2025-05-07T20:33:04.1811714Z         compiled: bool,
2025-05-07T20:33:04.1811966Z     ) -> None:
2025-05-07T20:33:04.1812203Z         torch.manual_seed(2025)
2025-05-07T20:33:04.1812461Z     
2025-05-07T20:33:04.1812761Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.1813126Z     
2025-05-07T20:33:04.1813606Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.1813929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.1814266Z         x = x_sign * x_clamp
2025-05-07T20:33:04.1814519Z         x0 = x[:, :D]
2025-05-07T20:33:04.1814762Z         x1 = x[:, D:]
2025-05-07T20:33:04.1814991Z     
2025-05-07T20:33:04.1815197Z         if contiguous:
2025-05-07T20:33:04.1815534Z             x0 = x0.contiguous()
2025-05-07T20:33:04.1815818Z             x1 = x1.contiguous()
2025-05-07T20:33:04.1816161Z     
2025-05-07T20:33:04.1816366Z         if scale_ub is not None:
2025-05-07T20:33:04.1816666Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.1817030Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.1817358Z             )
2025-05-07T20:33:04.1817576Z         else:
2025-05-07T20:33:04.1817807Z             scale_ub_tensor = None
2025-05-07T20:33:04.1818081Z     
2025-05-07T20:33:04.1818334Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.1818673Z             op = silu_mul_quant
2025-05-07T20:33:04.1818936Z             if compiled:
2025-05-07T20:33:04.1819201Z                 op = torch.compile(op)
2025-05-07T20:33:04.1819516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1819806Z     
2025-05-07T20:33:04.1820014Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.1820195Z 
2025-05-07T20:33:04.1820309Z moe/activation_test.py:117: 
2025-05-07T20:33:04.1820620Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1820977Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.1821282Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1821866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.1822456Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.1823148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.1823872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.1824432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.1825153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.1825861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.1826424Z     kernel = self.compile(
2025-05-07T20:33:04.1826989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.1827681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.1828106Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1828350Z 
2025-05-07T20:33:04.1828567Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78015eb50>
2025-05-07T20:33:04.1829700Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.1831312Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0d300>}
2025-05-07T20:33:04.1832733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.1833800Z context = <triton._C.libtriton.ir.context object at 0x7ff7801c4470>
2025-05-07T20:33:04.1834104Z 
2025-05-07T20:33:04.1834282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.1834838Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.1835340Z                            module_map=module_map)
2025-05-07T20:33:04.1835733Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.1836129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.1836405Z E       ^
2025-05-07T20:33:04.1836901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.1837454Z 
2025-05-07T20:33:04.1837941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.1838481Z 
2025-05-07T20:33:04.1838600Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.1839039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.1839467Z     T=128,
2025-05-07T20:33:04.1839673Z     D=7168,
2025-05-07T20:33:04.1839879Z     scale_ub=1200.0,
2025-05-07T20:33:04.1840221Z     contiguous=False,
2025-05-07T20:33:04.1840466Z     compiled=True,
2025-05-07T20:33:04.1840684Z )
2025-05-07T20:33:04.1841025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.1841553Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:04.1841843Z 
2025-05-07T20:33:04.1841933Z     @given(
2025-05-07T20:33:04.1842181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.1842522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.1842851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.1843197Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.1843550Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.1843857Z     )
2025-05-07T20:33:04.1844224Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.1844692Z     def test_silu_mul_quant(
2025-05-07T20:33:04.1844952Z         self,
2025-05-07T20:33:04.1845156Z         T: int,
2025-05-07T20:33:04.1845370Z         D: int,
2025-05-07T20:33:04.1845605Z         scale_ub: Optional[float],
2025-05-07T20:33:04.1845890Z         contiguous: bool,
2025-05-07T20:33:04.1846148Z         compiled: bool,
2025-05-07T20:33:04.1846392Z     ) -> None:
2025-05-07T20:33:04.1846618Z         torch.manual_seed(2025)
2025-05-07T20:33:04.1846882Z     
2025-05-07T20:33:04.1847178Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.1847544Z     
2025-05-07T20:33:04.1847750Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.1848065Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.1848395Z         x = x_sign * x_clamp
2025-05-07T20:33:04.1848651Z         x0 = x[:, :D]
2025-05-07T20:33:04.1848882Z         x1 = x[:, D:]
2025-05-07T20:33:04.1849108Z     
2025-05-07T20:33:04.1849305Z         if contiguous:
2025-05-07T20:33:04.1849557Z             x0 = x0.contiguous()
2025-05-07T20:33:04.1849839Z             x1 = x1.contiguous()
2025-05-07T20:33:04.1850092Z     
2025-05-07T20:33:04.1850303Z         if scale_ub is not None:
2025-05-07T20:33:04.1850597Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.1850950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.1851347Z             )
2025-05-07T20:33:04.1851560Z         else:
2025-05-07T20:33:04.1851790Z             scale_ub_tensor = None
2025-05-07T20:33:04.1852109Z     
2025-05-07T20:33:04.1852366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.1852722Z             op = silu_mul_quant
2025-05-07T20:33:04.1852985Z             if compiled:
2025-05-07T20:33:04.1853253Z                 op = torch.compile(op)
2025-05-07T20:33:04.1853567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1853855Z     
2025-05-07T20:33:04.1854063Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.1854237Z 
2025-05-07T20:33:04.1854349Z moe/activation_test.py:117: 
2025-05-07T20:33:04.1854665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1855012Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.1855312Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.1855903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.1856534Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.1857270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.1857990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.1858545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.1859258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.1859953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.1860509Z     kernel = self.compile(
2025-05-07T20:33:04.1861068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.1861761Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.1862184Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.1862429Z 
2025-05-07T20:33:04.1862657Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa29ad0>
2025-05-07T20:33:04.1863773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.1865200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0e160>}
2025-05-07T20:33:04.1866594Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.1867655Z context = <triton._C.libtriton.ir.context object at 0x7ff7801ef870>
2025-05-07T20:33:04.1867960Z 
2025-05-07T20:33:04.1868143Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.1868690Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.1869180Z                            module_map=module_map)
2025-05-07T20:33:04.1869565Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.1869935Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.1870211Z E       ^
2025-05-07T20:33:04.1870695Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.1871163Z 
2025-05-07T20:33:04.1871603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.1872135Z 
2025-05-07T20:33:04.1872302Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.1872739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.1873171Z     T=2048,
2025-05-07T20:33:04.1873409Z     D=7168,
2025-05-07T20:33:04.1873617Z     scale_ub=None,
2025-05-07T20:33:04.1873849Z     contiguous=True,
2025-05-07T20:33:04.1874081Z     compiled=True,
2025-05-07T20:33:04.3173040Z )
2025-05-07T20:33:04.3173624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3174374Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.3174767Z 
2025-05-07T20:33:04.3174879Z     @given(
2025-05-07T20:33:04.3175153Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3175485Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3175803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3176153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3176512Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3176803Z     )
2025-05-07T20:33:04.3177169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3177910Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3178248Z         self,
2025-05-07T20:33:04.3178453Z         T: int,
2025-05-07T20:33:04.3178661Z         D: int,
2025-05-07T20:33:04.3178890Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3179169Z         contiguous: bool,
2025-05-07T20:33:04.3179419Z         compiled: bool,
2025-05-07T20:33:04.3179656Z     ) -> None:
2025-05-07T20:33:04.3179874Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3180129Z     
2025-05-07T20:33:04.3180414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3180764Z     
2025-05-07T20:33:04.3180969Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3181274Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3181592Z         x = x_sign * x_clamp
2025-05-07T20:33:04.3181848Z         x0 = x[:, :D]
2025-05-07T20:33:04.3182073Z         x1 = x[:, D:]
2025-05-07T20:33:04.3182291Z     
2025-05-07T20:33:04.3182487Z         if contiguous:
2025-05-07T20:33:04.3182731Z             x0 = x0.contiguous()
2025-05-07T20:33:04.3182997Z             x1 = x1.contiguous()
2025-05-07T20:33:04.3183253Z     
2025-05-07T20:33:04.3183453Z         if scale_ub is not None:
2025-05-07T20:33:04.3183740Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.3184089Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.3184417Z             )
2025-05-07T20:33:04.3184620Z         else:
2025-05-07T20:33:04.3184839Z             scale_ub_tensor = None
2025-05-07T20:33:04.3185104Z     
2025-05-07T20:33:04.3185348Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.3185674Z             op = silu_mul_quant
2025-05-07T20:33:04.3185937Z             if compiled:
2025-05-07T20:33:04.3186198Z                 op = torch.compile(op)
2025-05-07T20:33:04.3186508Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3186804Z     
2025-05-07T20:33:04.3187016Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.3187188Z 
2025-05-07T20:33:04.3187298Z moe/activation_test.py:117: 
2025-05-07T20:33:04.3187612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3187965Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.3188264Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.3188846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.3189433Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.3190123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.3190833Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.3191396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.3192205Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.3192975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.3193527Z     kernel = self.compile(
2025-05-07T20:33:04.3194093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.3194774Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.3195192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.3195435Z 
2025-05-07T20:33:04.3195647Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780a24550>
2025-05-07T20:33:04.3196768Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.3198296Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0f420>}
2025-05-07T20:33:04.3199694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.3200873Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfb56fb0>
2025-05-07T20:33:04.3201176Z 
2025-05-07T20:33:04.3201351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.3201900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.3202392Z                            module_map=module_map)
2025-05-07T20:33:04.3202774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.3203144Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.3203418Z E       ^
2025-05-07T20:33:04.3203908Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.3204375Z 
2025-05-07T20:33:04.3204810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.3205348Z 
2025-05-07T20:33:04.3205455Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3205888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3206311Z     T=16384,
2025-05-07T20:33:04.3206508Z     D=5120,
2025-05-07T20:33:04.3206728Z     scale_ub=None,
2025-05-07T20:33:04.3206956Z     contiguous=False,
2025-05-07T20:33:04.3207192Z     compiled=False,
2025-05-07T20:33:04.3207409Z )
2025-05-07T20:33:04.3207745Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3208268Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.3208575Z 
2025-05-07T20:33:04.3208657Z     @given(
2025-05-07T20:33:04.3208908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3217715Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3218086Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3218439Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3218777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3219077Z     )
2025-05-07T20:33:04.3219449Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3219914Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3220171Z         self,
2025-05-07T20:33:04.3220376Z         T: int,
2025-05-07T20:33:04.3220576Z         D: int,
2025-05-07T20:33:04.3220809Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3221220Z         contiguous: bool,
2025-05-07T20:33:04.3221466Z         compiled: bool,
2025-05-07T20:33:04.3221705Z     ) -> None:
2025-05-07T20:33:04.3221938Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3222254Z     
2025-05-07T20:33:04.3222551Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3222917Z     
2025-05-07T20:33:04.3223133Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3223439Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3225552Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.3227509Z 
2025-05-07T20:33:04.3227702Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:04.3227937Z 
2025-05-07T20:33:04.3228112Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3228551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3228969Z     T=4096,
2025-05-07T20:33:04.3229171Z     D=7168,
2025-05-07T20:33:04.3229379Z     scale_ub=1200.0,
2025-05-07T20:33:04.3229609Z     contiguous=True,
2025-05-07T20:33:04.3229846Z     compiled=True,
2025-05-07T20:33:04.3230062Z )
2025-05-07T20:33:04.3230393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3230916Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:04.3231204Z 
2025-05-07T20:33:04.3231284Z     @given(
2025-05-07T20:33:04.3231525Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3231852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3232177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3232531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3232871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3233177Z     )
2025-05-07T20:33:04.3233544Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3234003Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3234257Z         self,
2025-05-07T20:33:04.3234463Z         T: int,
2025-05-07T20:33:04.3234674Z         D: int,
2025-05-07T20:33:04.3234895Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3235185Z         contiguous: bool,
2025-05-07T20:33:04.3235437Z         compiled: bool,
2025-05-07T20:33:04.3235663Z     ) -> None:
2025-05-07T20:33:04.3235887Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3236140Z     
2025-05-07T20:33:04.3236417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3236776Z     
2025-05-07T20:33:04.3236981Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.3237284Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.3239370Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.3241382Z 
2025-05-07T20:33:04.3241505Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:04.3241734Z 
2025-05-07T20:33:04.3241841Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.3242333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.3242906Z     T=16384,
2025-05-07T20:33:04.3243186Z     D=7168,
2025-05-07T20:33:04.3243523Z     scale_ub=None,
2025-05-07T20:33:04.3243842Z     contiguous=False,
2025-05-07T20:33:04.3244172Z     compiled=False,
2025-05-07T20:33:04.3244465Z )
2025-05-07T20:33:04.3244916Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.3245645Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.3246033Z 
2025-05-07T20:33:04.3246154Z     @given(
2025-05-07T20:33:04.3246482Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.3246927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.3247369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.3247855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.3248328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.3248759Z     )
2025-05-07T20:33:04.3249273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.3249998Z     def test_silu_mul_quant(
2025-05-07T20:33:04.3250360Z         self,
2025-05-07T20:33:04.3250694Z         T: int,
2025-05-07T20:33:04.3251010Z         D: int,
2025-05-07T20:33:04.3251339Z         scale_ub: Optional[float],
2025-05-07T20:33:04.3251750Z         contiguous: bool,
2025-05-07T20:33:04.3252089Z         compiled: bool,
2025-05-07T20:33:04.3252420Z     ) -> None:
2025-05-07T20:33:04.3252651Z         torch.manual_seed(2025)
2025-05-07T20:33:04.3252905Z     
2025-05-07T20:33:04.3253185Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.3255335Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.3257279Z 
2025-05-07T20:33:04.3257401Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.4493201Z 
2025-05-07T20:33:04.4494034Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.4495418Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.4496581Z     T=2048,
2025-05-07T20:33:04.4497118Z     D=7168,
2025-05-07T20:33:04.4497663Z     scale_ub=1200.0,
2025-05-07T20:33:04.4498259Z     contiguous=True,
2025-05-07T20:33:04.4498852Z     compiled=True,
2025-05-07T20:33:04.4499265Z )
2025-05-07T20:33:04.4499918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.4500961Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:04.4501529Z 
2025-05-07T20:33:04.4501687Z     @given(
2025-05-07T20:33:04.4502167Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.4502823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.4503446Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.4503952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.4504300Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.4504595Z     )
2025-05-07T20:33:04.4504962Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.4505429Z     def test_silu_mul_quant(
2025-05-07T20:33:04.4505692Z         self,
2025-05-07T20:33:04.4505894Z         T: int,
2025-05-07T20:33:04.4506103Z         D: int,
2025-05-07T20:33:04.4506339Z         scale_ub: Optional[float],
2025-05-07T20:33:04.4506621Z         contiguous: bool,
2025-05-07T20:33:04.4507165Z         compiled: bool,
2025-05-07T20:33:04.4507412Z     ) -> None:
2025-05-07T20:33:04.4507637Z         torch.manual_seed(2025)
2025-05-07T20:33:04.4507893Z     
2025-05-07T20:33:04.4508275Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.4508628Z     
2025-05-07T20:33:04.4508839Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.4509151Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.4511235Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.4513169Z 
2025-05-07T20:33:04.4513705Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:04.4514125Z 
2025-05-07T20:33:04.4514241Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.4514788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.4515214Z     T=2048,
2025-05-07T20:33:04.4515406Z     D=7168,
2025-05-07T20:33:04.4515610Z     scale_ub=None,
2025-05-07T20:33:04.4515836Z     contiguous=True,
2025-05-07T20:33:04.4516066Z     compiled=False,
2025-05-07T20:33:04.4516282Z )
2025-05-07T20:33:04.4516619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.4517129Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.4517418Z 
2025-05-07T20:33:04.4517499Z     @given(
2025-05-07T20:33:04.4517744Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.4518075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.4518394Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.4518742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.4519089Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.4519386Z     )
2025-05-07T20:33:04.4519754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.4520317Z     def test_silu_mul_quant(
2025-05-07T20:33:04.4520570Z         self,
2025-05-07T20:33:04.4520775Z         T: int,
2025-05-07T20:33:04.4520981Z         D: int,
2025-05-07T20:33:04.4521204Z         scale_ub: Optional[float],
2025-05-07T20:33:04.4521490Z         contiguous: bool,
2025-05-07T20:33:04.4521743Z         compiled: bool,
2025-05-07T20:33:04.4521969Z     ) -> None:
2025-05-07T20:33:04.4522194Z         torch.manual_seed(2025)
2025-05-07T20:33:04.4522449Z     
2025-05-07T20:33:04.4522729Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.4523092Z     
2025-05-07T20:33:04.4523301Z >       x_sign = torch.sign(x)
2025-05-07T20:33:04.4525331Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.4527240Z 
2025-05-07T20:33:04.4527371Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:04.4527591Z 
2025-05-07T20:33:04.4527700Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.4528139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.4528645Z     T=1,
2025-05-07T20:33:04.4528833Z     D=7168,
2025-05-07T20:33:04.4529033Z     scale_ub=1200.0,
2025-05-07T20:33:04.4529268Z     contiguous=True,
2025-05-07T20:33:04.4529504Z     compiled=False,
2025-05-07T20:33:04.4529774Z )
2025-05-07T20:33:04.4530110Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.4530619Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.4530894Z 
2025-05-07T20:33:04.4530972Z     @given(
2025-05-07T20:33:04.4531216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.4531546Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.4531861Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.4532208Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.4532553Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.4532843Z     )
2025-05-07T20:33:04.4533226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.4533687Z     def test_silu_mul_quant(
2025-05-07T20:33:04.4534035Z         self,
2025-05-07T20:33:04.4534245Z         T: int,
2025-05-07T20:33:04.4534449Z         D: int,
2025-05-07T20:33:04.4534721Z         scale_ub: Optional[float],
2025-05-07T20:33:04.4535013Z         contiguous: bool,
2025-05-07T20:33:04.4535258Z         compiled: bool,
2025-05-07T20:33:04.4535491Z     ) -> None:
2025-05-07T20:33:04.4535718Z         torch.manual_seed(2025)
2025-05-07T20:33:04.4535966Z     
2025-05-07T20:33:04.4536250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.4536605Z     
2025-05-07T20:33:04.4536806Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.4537113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.4537439Z         x = x_sign * x_clamp
2025-05-07T20:33:04.4537684Z         x0 = x[:, :D]
2025-05-07T20:33:04.4537911Z         x1 = x[:, D:]
2025-05-07T20:33:04.4538132Z     
2025-05-07T20:33:04.4538330Z         if contiguous:
2025-05-07T20:33:04.4538565Z             x0 = x0.contiguous()
2025-05-07T20:33:04.4538840Z             x1 = x1.contiguous()
2025-05-07T20:33:04.4539091Z     
2025-05-07T20:33:04.4539290Z         if scale_ub is not None:
2025-05-07T20:33:04.4539577Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.4539928Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.4540247Z             )
2025-05-07T20:33:04.4540451Z         else:
2025-05-07T20:33:04.4540669Z             scale_ub_tensor = None
2025-05-07T20:33:04.4540928Z     
2025-05-07T20:33:04.4541173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.4541502Z             op = silu_mul_quant
2025-05-07T20:33:04.4541758Z             if compiled:
2025-05-07T20:33:04.4542017Z                 op = torch.compile(op)
2025-05-07T20:33:04.4542336Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.4542618Z     
2025-05-07T20:33:04.4542823Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.4543005Z 
2025-05-07T20:33:04.4543108Z moe/activation_test.py:117: 
2025-05-07T20:33:04.4543418Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.4543766Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.4544063Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.4544782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.4545494Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.4546058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.4546771Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.4547463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.4548010Z     kernel = self.compile(
2025-05-07T20:33:04.4548638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.4549365Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.4549786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.4550030Z 
2025-05-07T20:33:04.4550246Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfc09cd0>
2025-05-07T20:33:04.4551371Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.4552805Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfbc22a0>}
2025-05-07T20:33:04.4554204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.4555346Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfa7ddb0>
2025-05-07T20:33:04.4555656Z 
2025-05-07T20:33:04.4555830Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.4556379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.4556869Z                            module_map=module_map)
2025-05-07T20:33:04.4557249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.4557622Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.4557899Z E       ^
2025-05-07T20:33:04.4558379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.4558856Z 
2025-05-07T20:33:04.4559290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.4559831Z 
2025-05-07T20:33:04.4559940Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.4560446Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.4560860Z     T=128,
2025-05-07T20:33:04.4561059Z     D=5120,
2025-05-07T20:33:04.4561257Z     scale_ub=None,
2025-05-07T20:33:04.4561475Z     contiguous=True,
2025-05-07T20:33:04.4561709Z     compiled=False,
2025-05-07T20:33:04.4561921Z )
2025-05-07T20:33:04.4562251Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.4562769Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.4563055Z 
2025-05-07T20:33:04.4563136Z     @given(
2025-05-07T20:33:04.4563379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.4563700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.4564023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.4564371Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.4564721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.4565022Z     )
2025-05-07T20:33:04.4565390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.4565849Z     def test_silu_mul_quant(
2025-05-07T20:33:04.4566103Z         self,
2025-05-07T20:33:04.4566308Z         T: int,
2025-05-07T20:33:04.4566509Z         D: int,
2025-05-07T20:33:04.4566735Z         scale_ub: Optional[float],
2025-05-07T20:33:04.4567016Z         contiguous: bool,
2025-05-07T20:33:04.4567263Z         compiled: bool,
2025-05-07T20:33:04.4567499Z     ) -> None:
2025-05-07T20:33:04.4567722Z         torch.manual_seed(2025)
2025-05-07T20:33:04.4567976Z     
2025-05-07T20:33:04.4568253Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.4568640Z     
2025-05-07T20:33:04.4568907Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.4569208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.4569541Z         x = x_sign * x_clamp
2025-05-07T20:33:04.4569836Z         x0 = x[:, :D]
2025-05-07T20:33:04.4570062Z         x1 = x[:, D:]
2025-05-07T20:33:04.4570280Z     
2025-05-07T20:33:04.4570475Z         if contiguous:
2025-05-07T20:33:04.4570717Z             x0 = x0.contiguous()
2025-05-07T20:33:04.4570983Z             x1 = x1.contiguous()
2025-05-07T20:33:04.4571234Z     
2025-05-07T20:33:04.4571437Z         if scale_ub is not None:
2025-05-07T20:33:04.4571719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.4572077Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.4572405Z             )
2025-05-07T20:33:04.4572603Z         else:
2025-05-07T20:33:04.4572826Z             scale_ub_tensor = None
2025-05-07T20:33:04.4573088Z     
2025-05-07T20:33:04.4573325Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.4573660Z             op = silu_mul_quant
2025-05-07T20:33:04.4573922Z             if compiled:
2025-05-07T20:33:04.4574265Z                 op = torch.compile(op)
2025-05-07T20:33:04.4574579Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.4574906Z     
2025-05-07T20:33:04.4575103Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.4575280Z 
2025-05-07T20:33:04.4575383Z moe/activation_test.py:117: 
2025-05-07T20:33:04.4575689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.4576041Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.4576330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.4577045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.4577759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.4578313Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.4579028Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.4579729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.4580283Z     kernel = self.compile(
2025-05-07T20:33:04.4580842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.4581533Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.4581949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.4582187Z 
2025-05-07T20:33:04.4582410Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa7b8d0>
2025-05-07T20:33:04.4583525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.4584960Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfbc31a0>}
2025-05-07T20:33:04.4586360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.4587420Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfad27f0>
2025-05-07T20:33:04.4587721Z 
2025-05-07T20:33:04.4587897Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.4588447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.4588937Z                            module_map=module_map)
2025-05-07T20:33:04.4589320Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.4589742Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.4590019Z E       ^
2025-05-07T20:33:04.4590548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.4591023Z 
2025-05-07T20:33:04.4591464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.5719644Z 
2025-05-07T20:33:04.5720305Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5720976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5721554Z     T=128,
2025-05-07T20:33:04.5721758Z     D=7168,
2025-05-07T20:33:04.5721963Z     scale_ub=None,
2025-05-07T20:33:04.5722184Z     contiguous=True,
2025-05-07T20:33:04.5722421Z     compiled=False,
2025-05-07T20:33:04.5722642Z )
2025-05-07T20:33:04.5722974Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5723514Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.5724147Z 
2025-05-07T20:33:04.5724229Z     @given(
2025-05-07T20:33:04.5724475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5724896Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5725223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5725574Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5725912Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5726213Z     )
2025-05-07T20:33:04.5726581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5727037Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5727290Z         self,
2025-05-07T20:33:04.5727494Z         T: int,
2025-05-07T20:33:04.5727694Z         D: int,
2025-05-07T20:33:04.5727922Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5728206Z         contiguous: bool,
2025-05-07T20:33:04.5728463Z         compiled: bool,
2025-05-07T20:33:04.5728691Z     ) -> None:
2025-05-07T20:33:04.5728917Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5729166Z     
2025-05-07T20:33:04.5729448Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5729803Z     
2025-05-07T20:33:04.5730005Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5730304Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5730632Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5730886Z         x0 = x[:, :D]
2025-05-07T20:33:04.5731107Z         x1 = x[:, D:]
2025-05-07T20:33:04.5731324Z     
2025-05-07T20:33:04.5731516Z         if contiguous:
2025-05-07T20:33:04.5731751Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5732021Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5732271Z     
2025-05-07T20:33:04.5732464Z         if scale_ub is not None:
2025-05-07T20:33:04.5732750Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5733103Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5733425Z             )
2025-05-07T20:33:04.5733638Z         else:
2025-05-07T20:33:04.5733862Z             scale_ub_tensor = None
2025-05-07T20:33:04.5734154Z     
2025-05-07T20:33:04.5734399Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5734726Z             op = silu_mul_quant
2025-05-07T20:33:04.5734995Z             if compiled:
2025-05-07T20:33:04.5735254Z                 op = torch.compile(op)
2025-05-07T20:33:04.5735564Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5735847Z     
2025-05-07T20:33:04.5736047Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5736219Z 
2025-05-07T20:33:04.5736328Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5736689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5737302Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5746328Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5747254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5748064Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.5748642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.5749355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.5750056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.5750617Z     kernel = self.compile(
2025-05-07T20:33:04.5751184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.5751881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5752302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5752548Z 
2025-05-07T20:33:04.5752775Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfc0b350>
2025-05-07T20:33:04.5753996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.5755443Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfaac040>}
2025-05-07T20:33:04.5756845Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.5757918Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf8118f0>
2025-05-07T20:33:04.5758221Z 
2025-05-07T20:33:04.5758407Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.5758951Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5759456Z                            module_map=module_map)
2025-05-07T20:33:04.5759846Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5760286Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5760565Z E       ^
2025-05-07T20:33:04.5761060Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5761534Z 
2025-05-07T20:33:04.5761975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.5762506Z 
2025-05-07T20:33:04.5762619Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5763058Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5763486Z     T=2048,
2025-05-07T20:33:04.5763680Z     D=7168,
2025-05-07T20:33:04.5763912Z     scale_ub=1200.0,
2025-05-07T20:33:04.5764177Z     contiguous=True,
2025-05-07T20:33:04.5764411Z     compiled=False,
2025-05-07T20:33:04.5764633Z )
2025-05-07T20:33:04.5764973Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5765498Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.5765785Z 
2025-05-07T20:33:04.5765868Z     @given(
2025-05-07T20:33:04.5766119Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5766453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5766774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5767125Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5767474Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5767770Z     )
2025-05-07T20:33:04.5768141Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5768665Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5768922Z         self,
2025-05-07T20:33:04.5769123Z         T: int,
2025-05-07T20:33:04.5769375Z         D: int,
2025-05-07T20:33:04.5769616Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5769899Z         contiguous: bool,
2025-05-07T20:33:04.5770155Z         compiled: bool,
2025-05-07T20:33:04.5770396Z     ) -> None:
2025-05-07T20:33:04.5770618Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5770871Z     
2025-05-07T20:33:04.5771160Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5773293Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.5775263Z 
2025-05-07T20:33:04.5775427Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.5775655Z 
2025-05-07T20:33:04.5775759Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.5776187Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.5776605Z     T=1,
2025-05-07T20:33:04.5776792Z     D=5120,
2025-05-07T20:33:04.5776995Z     scale_ub=1200.0,
2025-05-07T20:33:04.5777228Z     contiguous=True,
2025-05-07T20:33:04.5777455Z     compiled=False,
2025-05-07T20:33:04.5777672Z )
2025-05-07T20:33:04.5778007Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.5778515Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.5778801Z 
2025-05-07T20:33:04.5778881Z     @given(
2025-05-07T20:33:04.5779126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.5779461Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.5779786Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.5780136Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.5780485Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.5780780Z     )
2025-05-07T20:33:04.5781150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.5781615Z     def test_silu_mul_quant(
2025-05-07T20:33:04.5781864Z         self,
2025-05-07T20:33:04.5782075Z         T: int,
2025-05-07T20:33:04.5782292Z         D: int,
2025-05-07T20:33:04.5782516Z         scale_ub: Optional[float],
2025-05-07T20:33:04.5782806Z         contiguous: bool,
2025-05-07T20:33:04.5783062Z         compiled: bool,
2025-05-07T20:33:04.5783297Z     ) -> None:
2025-05-07T20:33:04.5783530Z         torch.manual_seed(2025)
2025-05-07T20:33:04.5783789Z     
2025-05-07T20:33:04.5784071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.5784434Z     
2025-05-07T20:33:04.5784642Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.5784944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.5785264Z         x = x_sign * x_clamp
2025-05-07T20:33:04.5785520Z         x0 = x[:, :D]
2025-05-07T20:33:04.5785752Z         x1 = x[:, D:]
2025-05-07T20:33:04.5785964Z     
2025-05-07T20:33:04.5786160Z         if contiguous:
2025-05-07T20:33:04.5786404Z             x0 = x0.contiguous()
2025-05-07T20:33:04.5786671Z             x1 = x1.contiguous()
2025-05-07T20:33:04.5786930Z     
2025-05-07T20:33:04.5787134Z         if scale_ub is not None:
2025-05-07T20:33:04.5787415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.5787771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.5788099Z             )
2025-05-07T20:33:04.5788352Z         else:
2025-05-07T20:33:04.5788580Z             scale_ub_tensor = None
2025-05-07T20:33:04.5788848Z     
2025-05-07T20:33:04.5789128Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.5789464Z             op = silu_mul_quant
2025-05-07T20:33:04.5789729Z             if compiled:
2025-05-07T20:33:04.5789988Z                 op = torch.compile(op)
2025-05-07T20:33:04.5790295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5790588Z     
2025-05-07T20:33:04.5790792Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.5790965Z 
2025-05-07T20:33:04.5791073Z moe/activation_test.py:117: 
2025-05-07T20:33:04.5791384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5791735Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.5792027Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.5792751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.5793479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.5794222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.5795152Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.5796023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.5796615Z     kernel = self.compile(
2025-05-07T20:33:04.5797180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.5797872Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.5798289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.5798526Z 
2025-05-07T20:33:04.5798748Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78015f750>
2025-05-07T20:33:04.5799874Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.5801368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfaad580>}
2025-05-07T20:33:04.5802769Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.5803951Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf8f0270>
2025-05-07T20:33:04.5804329Z 
2025-05-07T20:33:04.5804559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.5805240Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.5805811Z                            module_map=module_map)
2025-05-07T20:33:04.5806204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.5806575Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.5806851Z E       ^
2025-05-07T20:33:04.5807341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.5807809Z 
2025-05-07T20:33:04.5808250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.6624698Z 
2025-05-07T20:33:04.6625272Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6625923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6626453Z     T=2048,
2025-05-07T20:33:04.6626658Z     D=5120,
2025-05-07T20:33:04.6626863Z     scale_ub=None,
2025-05-07T20:33:04.6627374Z     contiguous=True,
2025-05-07T20:33:04.6627606Z     compiled=False,
2025-05-07T20:33:04.6627826Z )
2025-05-07T20:33:04.6628174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.6628787Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.6629083Z 
2025-05-07T20:33:04.6629166Z     @given(
2025-05-07T20:33:04.6629413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.6629746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.6630067Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.6630417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.6630766Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.6631063Z     )
2025-05-07T20:33:04.6631434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.6631900Z     def test_silu_mul_quant(
2025-05-07T20:33:04.6632152Z         self,
2025-05-07T20:33:04.6632359Z         T: int,
2025-05-07T20:33:04.6632568Z         D: int,
2025-05-07T20:33:04.6632884Z         scale_ub: Optional[float],
2025-05-07T20:33:04.6633167Z         contiguous: bool,
2025-05-07T20:33:04.6633425Z         compiled: bool,
2025-05-07T20:33:04.6633752Z     ) -> None:
2025-05-07T20:33:04.6633976Z         torch.manual_seed(2025)
2025-05-07T20:33:04.6634232Z     
2025-05-07T20:33:04.6634519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.6634874Z     
2025-05-07T20:33:04.6635081Z >       x_sign = torch.sign(x)
2025-05-07T20:33:04.6637111Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.6639058Z 
2025-05-07T20:33:04.6639192Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:04.6639415Z 
2025-05-07T20:33:04.6639531Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6639966Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6640511Z     T=16384,
2025-05-07T20:33:04.6640719Z     D=5120,
2025-05-07T20:33:04.6640928Z     scale_ub=None,
2025-05-07T20:33:04.6641148Z     contiguous=True,
2025-05-07T20:33:04.6641389Z     compiled=False,
2025-05-07T20:33:04.6641605Z )
2025-05-07T20:33:04.6641967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.6642492Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.6642786Z 
2025-05-07T20:33:04.6642874Z     @given(
2025-05-07T20:33:04.6643118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.6643456Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.6643777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.6644133Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.6644486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.6644788Z     )
2025-05-07T20:33:04.6645165Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.6645629Z     def test_silu_mul_quant(
2025-05-07T20:33:04.6645886Z         self,
2025-05-07T20:33:04.6646086Z         T: int,
2025-05-07T20:33:04.6646296Z         D: int,
2025-05-07T20:33:04.6646525Z         scale_ub: Optional[float],
2025-05-07T20:33:04.6646807Z         contiguous: bool,
2025-05-07T20:33:04.6647069Z         compiled: bool,
2025-05-07T20:33:04.6647303Z     ) -> None:
2025-05-07T20:33:04.6647527Z         torch.manual_seed(2025)
2025-05-07T20:33:04.6647869Z     
2025-05-07T20:33:04.6648159Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.6650334Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.6652259Z 
2025-05-07T20:33:04.6652389Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.6652616Z 
2025-05-07T20:33:04.6652724Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6653160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6653589Z     T=4096,
2025-05-07T20:33:04.6653784Z     D=5120,
2025-05-07T20:33:04.6654034Z     scale_ub=None,
2025-05-07T20:33:04.6654263Z     contiguous=True,
2025-05-07T20:33:04.6654502Z     compiled=False,
2025-05-07T20:33:04.6654758Z )
2025-05-07T20:33:04.6655091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.6655610Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.6655903Z 
2025-05-07T20:33:04.6655989Z     @given(
2025-05-07T20:33:04.6656232Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.6656569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.6656887Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.6657242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.6657590Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.6657886Z     )
2025-05-07T20:33:04.6658256Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.6658734Z     def test_silu_mul_quant(
2025-05-07T20:33:04.6658991Z         self,
2025-05-07T20:33:04.6659202Z         T: int,
2025-05-07T20:33:04.6659413Z         D: int,
2025-05-07T20:33:04.6659648Z         scale_ub: Optional[float],
2025-05-07T20:33:04.6659941Z         contiguous: bool,
2025-05-07T20:33:04.6660196Z         compiled: bool,
2025-05-07T20:33:04.6660428Z     ) -> None:
2025-05-07T20:33:04.6660660Z         torch.manual_seed(2025)
2025-05-07T20:33:04.6660919Z     
2025-05-07T20:33:04.6661213Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.6663330Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.6665263Z 
2025-05-07T20:33:04.6665389Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.6665633Z 
2025-05-07T20:33:04.6665742Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6666190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6666612Z     T=2048,
2025-05-07T20:33:04.6666817Z     D=5120,
2025-05-07T20:33:04.6667027Z     scale_ub=None,
2025-05-07T20:33:04.6667259Z     contiguous=False,
2025-05-07T20:33:04.6667495Z     compiled=False,
2025-05-07T20:33:04.6667720Z )
2025-05-07T20:33:04.6668055Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.6668571Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.6668913Z 
2025-05-07T20:33:04.6668995Z     @given(
2025-05-07T20:33:04.6669239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.6669601Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.6669927Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.6670273Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.6670612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.6670914Z     )
2025-05-07T20:33:04.6671283Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.6671747Z     def test_silu_mul_quant(
2025-05-07T20:33:04.6671997Z         self,
2025-05-07T20:33:04.6672203Z         T: int,
2025-05-07T20:33:04.6672411Z         D: int,
2025-05-07T20:33:04.6672636Z         scale_ub: Optional[float],
2025-05-07T20:33:04.6672921Z         contiguous: bool,
2025-05-07T20:33:04.6673174Z         compiled: bool,
2025-05-07T20:33:04.6673407Z     ) -> None:
2025-05-07T20:33:04.6673636Z         torch.manual_seed(2025)
2025-05-07T20:33:04.6673919Z     
2025-05-07T20:33:04.6674272Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.6676467Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.6678392Z 
2025-05-07T20:33:04.6678516Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.6678743Z 
2025-05-07T20:33:04.6678853Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6679291Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6679710Z     T=4096,
2025-05-07T20:33:04.6679912Z     D=7168,
2025-05-07T20:33:04.6680188Z     scale_ub=None,
2025-05-07T20:33:04.6680411Z     contiguous=True,
2025-05-07T20:33:04.6680648Z     compiled=True,
2025-05-07T20:33:04.6680864Z )
2025-05-07T20:33:04.6681192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.6681713Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:04.6681998Z 
2025-05-07T20:33:04.6682078Z     @given(
2025-05-07T20:33:04.6682320Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.6682649Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.6682973Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.6683322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.6683663Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.6683969Z     )
2025-05-07T20:33:04.6684339Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.6684802Z     def test_silu_mul_quant(
2025-05-07T20:33:04.6685061Z         self,
2025-05-07T20:33:04.6685271Z         T: int,
2025-05-07T20:33:04.6685473Z         D: int,
2025-05-07T20:33:04.6685707Z         scale_ub: Optional[float],
2025-05-07T20:33:04.6685992Z         contiguous: bool,
2025-05-07T20:33:04.6686239Z         compiled: bool,
2025-05-07T20:33:04.6686472Z     ) -> None:
2025-05-07T20:33:04.6686699Z         torch.manual_seed(2025)
2025-05-07T20:33:04.6686959Z     
2025-05-07T20:33:04.6687239Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.6689421Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.6691394Z 
2025-05-07T20:33:04.6691519Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.6691743Z 
2025-05-07T20:33:04.6691857Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.6692288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.6692710Z     T=2048,
2025-05-07T20:33:04.6692908Z     D=5120,
2025-05-07T20:33:04.6693113Z     scale_ub=1200.0,
2025-05-07T20:33:04.6693345Z     contiguous=False,
2025-05-07T20:33:04.6693585Z     compiled=False,
2025-05-07T20:33:04.7245335Z )
2025-05-07T20:33:04.7245871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7246601Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.7247217Z 
2025-05-07T20:33:04.7247347Z     @given(
2025-05-07T20:33:04.7247679Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7248195Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7248526Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7248872Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7249227Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7249533Z     )
2025-05-07T20:33:04.7249896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7250363Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7250622Z         self,
2025-05-07T20:33:04.7250829Z         T: int,
2025-05-07T20:33:04.7251034Z         D: int,
2025-05-07T20:33:04.7251267Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7251555Z         contiguous: bool,
2025-05-07T20:33:04.7251811Z         compiled: bool,
2025-05-07T20:33:04.7252052Z     ) -> None:
2025-05-07T20:33:04.7252287Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7252577Z     
2025-05-07T20:33:04.7252871Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7254997Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.7256933Z 
2025-05-07T20:33:04.7257059Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.7257291Z 
2025-05-07T20:33:04.7257400Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7257843Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7258266Z     T=4096,
2025-05-07T20:33:04.7258466Z     D=7168,
2025-05-07T20:33:04.7258675Z     scale_ub=1200.0,
2025-05-07T20:33:04.7258910Z     contiguous=True,
2025-05-07T20:33:04.7259141Z     compiled=False,
2025-05-07T20:33:04.7259358Z )
2025-05-07T20:33:04.7259699Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7260219Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.7260515Z 
2025-05-07T20:33:04.7260599Z     @given(
2025-05-07T20:33:04.7260841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7261173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7261502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7261852Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7262282Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7262590Z     )
2025-05-07T20:33:04.7263028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7263499Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7263756Z         self,
2025-05-07T20:33:04.7263964Z         T: int,
2025-05-07T20:33:04.7264175Z         D: int,
2025-05-07T20:33:04.7264414Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7264703Z         contiguous: bool,
2025-05-07T20:33:04.7264953Z         compiled: bool,
2025-05-07T20:33:04.7265191Z     ) -> None:
2025-05-07T20:33:04.7265423Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7265675Z     
2025-05-07T20:33:04.7265970Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7268146Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.7270108Z 
2025-05-07T20:33:04.7270240Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.7270465Z 
2025-05-07T20:33:04.7270575Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7271015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7271439Z     T=16384,
2025-05-07T20:33:04.7271645Z     D=7168,
2025-05-07T20:33:04.7271916Z     scale_ub=None,
2025-05-07T20:33:04.7272266Z     contiguous=False,
2025-05-07T20:33:04.7272714Z     compiled=True,
2025-05-07T20:33:04.7273044Z )
2025-05-07T20:33:04.7273465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7282332Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:04.7282644Z 
2025-05-07T20:33:04.7282736Z     @given(
2025-05-07T20:33:04.7282979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7283313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7283640Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7283984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7284332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7284634Z     )
2025-05-07T20:33:04.7285005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7285466Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7285722Z         self,
2025-05-07T20:33:04.7285926Z         T: int,
2025-05-07T20:33:04.7286126Z         D: int,
2025-05-07T20:33:04.7286354Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7286641Z         contiguous: bool,
2025-05-07T20:33:04.7286886Z         compiled: bool,
2025-05-07T20:33:04.7287124Z     ) -> None:
2025-05-07T20:33:04.7287352Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7287598Z     
2025-05-07T20:33:04.7287894Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7290043Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.7291986Z 
2025-05-07T20:33:04.7292190Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.7292411Z 
2025-05-07T20:33:04.7292526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7293001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7293429Z     T=4096,
2025-05-07T20:33:04.7293627Z     D=7168,
2025-05-07T20:33:04.7293823Z     scale_ub=None,
2025-05-07T20:33:04.7294053Z     contiguous=True,
2025-05-07T20:33:04.7294289Z     compiled=False,
2025-05-07T20:33:04.7294498Z )
2025-05-07T20:33:04.7294839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7295363Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.7295646Z 
2025-05-07T20:33:04.7295735Z     @given(
2025-05-07T20:33:04.7295972Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7296301Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7296625Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7296970Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7297365Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7297668Z     )
2025-05-07T20:33:04.7298073Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7298538Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7298794Z         self,
2025-05-07T20:33:04.7298992Z         T: int,
2025-05-07T20:33:04.7299203Z         D: int,
2025-05-07T20:33:04.7299434Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7299712Z         contiguous: bool,
2025-05-07T20:33:04.7299967Z         compiled: bool,
2025-05-07T20:33:04.7300203Z     ) -> None:
2025-05-07T20:33:04.7300433Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7300681Z     
2025-05-07T20:33:04.7300967Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7303088Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.7305017Z 
2025-05-07T20:33:04.7305146Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.7305366Z 
2025-05-07T20:33:04.7305474Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7305908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7306327Z     T=16384,
2025-05-07T20:33:04.7306530Z     D=7168,
2025-05-07T20:33:04.7306724Z     scale_ub=None,
2025-05-07T20:33:04.7306949Z     contiguous=True,
2025-05-07T20:33:04.7307184Z     compiled=False,
2025-05-07T20:33:04.7307392Z )
2025-05-07T20:33:04.7307723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7308252Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:04.7308544Z 
2025-05-07T20:33:04.7308625Z     @given(
2025-05-07T20:33:04.7308866Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7309191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7309504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7309850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7310195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7310495Z     )
2025-05-07T20:33:04.7310852Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7311311Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7311566Z         self,
2025-05-07T20:33:04.7311763Z         T: int,
2025-05-07T20:33:04.7312024Z         D: int,
2025-05-07T20:33:04.7312259Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7312543Z         contiguous: bool,
2025-05-07T20:33:04.7312796Z         compiled: bool,
2025-05-07T20:33:04.7313069Z     ) -> None:
2025-05-07T20:33:04.7313293Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7313963Z     
2025-05-07T20:33:04.7314300Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7316453Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.7318498Z 
2025-05-07T20:33:04.7318630Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.7318976Z 
2025-05-07T20:33:04.7319088Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.7319585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.7320009Z     T=16384,
2025-05-07T20:33:04.7320299Z     D=7168,
2025-05-07T20:33:04.7320504Z     scale_ub=1200.0,
2025-05-07T20:33:04.7320739Z     contiguous=True,
2025-05-07T20:33:04.7320964Z     compiled=False,
2025-05-07T20:33:04.7321181Z )
2025-05-07T20:33:04.7321517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.7322031Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:04.7322333Z 
2025-05-07T20:33:04.7322415Z     @given(
2025-05-07T20:33:04.7322656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.7322983Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.7323302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.7323652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.7324004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.7324299Z     )
2025-05-07T20:33:04.7324664Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.7325124Z     def test_silu_mul_quant(
2025-05-07T20:33:04.7325371Z         self,
2025-05-07T20:33:04.7325576Z         T: int,
2025-05-07T20:33:04.7325782Z         D: int,
2025-05-07T20:33:04.7326002Z         scale_ub: Optional[float],
2025-05-07T20:33:04.7326290Z         contiguous: bool,
2025-05-07T20:33:04.7326544Z         compiled: bool,
2025-05-07T20:33:04.7326779Z     ) -> None:
2025-05-07T20:33:04.7326995Z         torch.manual_seed(2025)
2025-05-07T20:33:04.7327247Z     
2025-05-07T20:33:04.7327533Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.7329649Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.7331573Z 
2025-05-07T20:33:04.7331694Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.9144238Z 
2025-05-07T20:33:04.9145058Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.9145754Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.9146368Z     T=128,
2025-05-07T20:33:04.9146643Z     D=5120,
2025-05-07T20:33:04.9146929Z     scale_ub=1200.0,
2025-05-07T20:33:04.9147514Z     contiguous=False,
2025-05-07T20:33:04.9147755Z     compiled=False,
2025-05-07T20:33:04.9147983Z )
2025-05-07T20:33:04.9148403Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.9148934Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:04.9149230Z 
2025-05-07T20:33:04.9149312Z     @given(
2025-05-07T20:33:04.9149557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.9149883Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.9150207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.9150557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.9150899Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.9151202Z     )
2025-05-07T20:33:04.9151572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.9152030Z     def test_silu_mul_quant(
2025-05-07T20:33:04.9152288Z         self,
2025-05-07T20:33:04.9152500Z         T: int,
2025-05-07T20:33:04.9152700Z         D: int,
2025-05-07T20:33:04.9153058Z         scale_ub: Optional[float],
2025-05-07T20:33:04.9153358Z         contiguous: bool,
2025-05-07T20:33:04.9153699Z         compiled: bool,
2025-05-07T20:33:04.9153935Z     ) -> None:
2025-05-07T20:33:04.9154194Z         torch.manual_seed(2025)
2025-05-07T20:33:04.9154440Z     
2025-05-07T20:33:04.9154725Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.9155082Z     
2025-05-07T20:33:04.9155280Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.9155588Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.9155915Z         x = x_sign * x_clamp
2025-05-07T20:33:04.9156160Z         x0 = x[:, :D]
2025-05-07T20:33:04.9156392Z         x1 = x[:, D:]
2025-05-07T20:33:04.9156617Z     
2025-05-07T20:33:04.9156805Z         if contiguous:
2025-05-07T20:33:04.9157047Z             x0 = x0.contiguous()
2025-05-07T20:33:04.9157321Z             x1 = x1.contiguous()
2025-05-07T20:33:04.9157564Z     
2025-05-07T20:33:04.9157767Z         if scale_ub is not None:
2025-05-07T20:33:04.9158054Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.9158405Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.9158723Z             )
2025-05-07T20:33:04.9158925Z         else:
2025-05-07T20:33:04.9159142Z             scale_ub_tensor = None
2025-05-07T20:33:04.9159397Z     
2025-05-07T20:33:04.9159637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.9159966Z             op = silu_mul_quant
2025-05-07T20:33:04.9160357Z             if compiled:
2025-05-07T20:33:04.9160615Z                 op = torch.compile(op)
2025-05-07T20:33:04.9160924Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.9161205Z     
2025-05-07T20:33:04.9161410Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.9161579Z 
2025-05-07T20:33:04.9161687Z moe/activation_test.py:117: 
2025-05-07T20:33:04.9162008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.9162361Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.9162652Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.9163380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.9164150Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.9164714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.9165432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.9166124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.9166685Z     kernel = self.compile(
2025-05-07T20:33:04.9167242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.9167979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.9168434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.9168675Z 
2025-05-07T20:33:04.9168894Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa2afd0>
2025-05-07T20:33:04.9170018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.9171484Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cf7e11c0>}
2025-05-07T20:33:04.9172878Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.9173949Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf98fa70>
2025-05-07T20:33:04.9174297Z 
2025-05-07T20:33:04.9174510Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.9175070Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.9175559Z                            module_map=module_map)
2025-05-07T20:33:04.9175936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.9176298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.9176568Z E       ^
2025-05-07T20:33:04.9177049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.9177523Z 
2025-05-07T20:33:04.9177956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:04.9178500Z 
2025-05-07T20:33:04.9178608Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.9179048Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.9179472Z     T=2048,
2025-05-07T20:33:04.9179666Z     D=7168,
2025-05-07T20:33:04.9179868Z     scale_ub=None,
2025-05-07T20:33:04.9180093Z     contiguous=False,
2025-05-07T20:33:04.9180324Z     compiled=False,
2025-05-07T20:33:04.9180538Z )
2025-05-07T20:33:04.9180873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.9181390Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:04.9181682Z 
2025-05-07T20:33:04.9181762Z     @given(
2025-05-07T20:33:04.9182004Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.9182333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.9182647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.9182998Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.9183345Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.9183642Z     )
2025-05-07T20:33:04.9184014Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.9184475Z     def test_silu_mul_quant(
2025-05-07T20:33:04.9184723Z         self,
2025-05-07T20:33:04.9184926Z         T: int,
2025-05-07T20:33:04.9185131Z         D: int,
2025-05-07T20:33:04.9185357Z         scale_ub: Optional[float],
2025-05-07T20:33:04.9185645Z         contiguous: bool,
2025-05-07T20:33:04.9185897Z         compiled: bool,
2025-05-07T20:33:04.9186124Z     ) -> None:
2025-05-07T20:33:04.9186353Z         torch.manual_seed(2025)
2025-05-07T20:33:04.9186608Z     
2025-05-07T20:33:04.9186891Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.9189075Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:04.9191042Z 
2025-05-07T20:33:04.9191165Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:04.9191392Z 
2025-05-07T20:33:04.9191498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:04.9191935Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:04.9192348Z     T=128,
2025-05-07T20:33:04.9192544Z     D=7168,
2025-05-07T20:33:04.9192745Z     scale_ub=1200.0,
2025-05-07T20:33:04.9192978Z     contiguous=True,
2025-05-07T20:33:04.9193209Z     compiled=True,
2025-05-07T20:33:04.9193425Z )
2025-05-07T20:33:04.9193756Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:04.9194362Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:04.9194649Z 
2025-05-07T20:33:04.9194767Z     @given(
2025-05-07T20:33:04.9195009Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:04.9195329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:04.9195649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:04.9195991Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:04.9196329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:04.9196627Z     )
2025-05-07T20:33:04.9196991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:04.9197451Z     def test_silu_mul_quant(
2025-05-07T20:33:04.9197694Z         self,
2025-05-07T20:33:04.9197896Z         T: int,
2025-05-07T20:33:04.9198101Z         D: int,
2025-05-07T20:33:04.9198325Z         scale_ub: Optional[float],
2025-05-07T20:33:04.9198608Z         contiguous: bool,
2025-05-07T20:33:04.9198861Z         compiled: bool,
2025-05-07T20:33:04.9199087Z     ) -> None:
2025-05-07T20:33:04.9199318Z         torch.manual_seed(2025)
2025-05-07T20:33:04.9199575Z     
2025-05-07T20:33:04.9199850Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:04.9200292Z     
2025-05-07T20:33:04.9200496Z         x_sign = torch.sign(x)
2025-05-07T20:33:04.9200794Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:04.9201116Z         x = x_sign * x_clamp
2025-05-07T20:33:04.9201367Z         x0 = x[:, :D]
2025-05-07T20:33:04.9201588Z         x1 = x[:, D:]
2025-05-07T20:33:04.9201808Z     
2025-05-07T20:33:04.9202000Z         if contiguous:
2025-05-07T20:33:04.9202238Z             x0 = x0.contiguous()
2025-05-07T20:33:04.9202508Z             x1 = x1.contiguous()
2025-05-07T20:33:04.9202759Z     
2025-05-07T20:33:04.9202959Z         if scale_ub is not None:
2025-05-07T20:33:04.9203244Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:04.9203599Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:04.9203924Z             )
2025-05-07T20:33:04.9204122Z         else:
2025-05-07T20:33:04.9204344Z             scale_ub_tensor = None
2025-05-07T20:33:04.9204608Z     
2025-05-07T20:33:04.9204844Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:04.9205176Z             op = silu_mul_quant
2025-05-07T20:33:04.9205436Z             if compiled:
2025-05-07T20:33:04.9205688Z                 op = torch.compile(op)
2025-05-07T20:33:04.9205999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.9206290Z     
2025-05-07T20:33:04.9206486Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:04.9206663Z 
2025-05-07T20:33:04.9206765Z moe/activation_test.py:117: 
2025-05-07T20:33:04.9207070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.9207479Z moe/activation_test.py:115: in fn
2025-05-07T20:33:04.9207768Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:04.9208391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:04.9208977Z     return fn(*args, **kwargs)
2025-05-07T20:33:04.9209657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:04.9210368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:04.9210931Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:04.9211643Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:04.9212332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:04.9212887Z     kernel = self.compile(
2025-05-07T20:33:04.9213758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:04.9214537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:04.9215011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:04.9215260Z 
2025-05-07T20:33:04.9215476Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78118e1d0>
2025-05-07T20:33:04.9216601Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:04.9218034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cf9f7b00>}
2025-05-07T20:33:04.9219431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:04.9220504Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf633fb0>
2025-05-07T20:33:04.9220817Z 
2025-05-07T20:33:04.9220994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:04.9221543Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:04.9222031Z                            module_map=module_map)
2025-05-07T20:33:04.9222415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:04.9222793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:04.9223063Z E       ^
2025-05-07T20:33:04.9223550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:04.9224053Z 
2025-05-07T20:33:04.9224512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.1986812Z 
2025-05-07T20:33:05.1987413Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.1988076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.1988656Z     T=128,
2025-05-07T20:33:05.1988913Z     D=7168,
2025-05-07T20:33:05.1989169Z     scale_ub=1200.0,
2025-05-07T20:33:05.1989463Z     contiguous=True,
2025-05-07T20:33:05.1989761Z     compiled=False,
2025-05-07T20:33:05.1989999Z )
2025-05-07T20:33:05.1990348Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.1990858Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.1991144Z 
2025-05-07T20:33:05.1991225Z     @given(
2025-05-07T20:33:05.1991465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.1991781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.1992102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.1992714Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.1993049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.1993434Z     )
2025-05-07T20:33:05.1993802Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.1994260Z     def test_silu_mul_quant(
2025-05-07T20:33:05.1994503Z         self,
2025-05-07T20:33:05.1994702Z         T: int,
2025-05-07T20:33:05.1994902Z         D: int,
2025-05-07T20:33:05.1995120Z         scale_ub: Optional[float],
2025-05-07T20:33:05.1995401Z         contiguous: bool,
2025-05-07T20:33:05.1995651Z         compiled: bool,
2025-05-07T20:33:05.1995879Z     ) -> None:
2025-05-07T20:33:05.1996104Z         torch.manual_seed(2025)
2025-05-07T20:33:05.1996352Z     
2025-05-07T20:33:05.1996627Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.1996979Z     
2025-05-07T20:33:05.1997179Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.1997481Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.1999768Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.2001818Z 
2025-05-07T20:33:05.2001943Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.2002169Z 
2025-05-07T20:33:05.2002275Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2002701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2003114Z     T=128,
2025-05-07T20:33:05.2003307Z     D=5120,
2025-05-07T20:33:05.2003521Z     scale_ub=1200.0,
2025-05-07T20:33:05.2003754Z     contiguous=True,
2025-05-07T20:33:05.2003983Z     compiled=True,
2025-05-07T20:33:05.2004198Z )
2025-05-07T20:33:05.2004532Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2005039Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.2005314Z 
2025-05-07T20:33:05.2005392Z     @given(
2025-05-07T20:33:05.2005628Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2005950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2006261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2006606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2006949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2007241Z     )
2025-05-07T20:33:05.2007607Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2008067Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2008315Z         self,
2025-05-07T20:33:05.2021798Z         T: int,
2025-05-07T20:33:05.2022187Z         D: int,
2025-05-07T20:33:05.2022619Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2023114Z         contiguous: bool,
2025-05-07T20:33:05.2023473Z         compiled: bool,
2025-05-07T20:33:05.2023822Z     ) -> None:
2025-05-07T20:33:05.2024150Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2024518Z     
2025-05-07T20:33:05.2024940Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2025461Z     
2025-05-07T20:33:05.2025767Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2026218Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2029445Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.2032289Z 
2025-05-07T20:33:05.2032473Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.2032790Z 
2025-05-07T20:33:05.2032961Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2033571Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2034199Z     T=128,
2025-05-07T20:33:05.2034505Z     D=7168,
2025-05-07T20:33:05.2034788Z     scale_ub=None,
2025-05-07T20:33:05.2035113Z     contiguous=True,
2025-05-07T20:33:05.2035448Z     compiled=True,
2025-05-07T20:33:05.2035742Z )
2025-05-07T20:33:05.2036218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2036938Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2037426Z 
2025-05-07T20:33:05.2037554Z     @given(
2025-05-07T20:33:05.2037976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2038452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2038915Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2039398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2039893Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2040426Z     )
2025-05-07T20:33:05.2040934Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2041477Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2041803Z         self,
2025-05-07T20:33:05.2042024Z         T: int,
2025-05-07T20:33:05.2042263Z         D: int,
2025-05-07T20:33:05.2042570Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2042889Z         contiguous: bool,
2025-05-07T20:33:05.2043198Z         compiled: bool,
2025-05-07T20:33:05.2043522Z     ) -> None:
2025-05-07T20:33:05.2048285Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2048550Z     
2025-05-07T20:33:05.2048849Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2050983Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.2052892Z 
2025-05-07T20:33:05.2053019Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.2053247Z 
2025-05-07T20:33:05.2053846Z FAILED
2025-05-07T20:33:05.2053969Z 
2025-05-07T20:33:05.2054114Z =================================== FAILURES ===================================
2025-05-07T20:33:05.2054568Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:05.2055038Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:05.2055681Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:05.2056244Z   |     yield
2025-05-07T20:33:05.2056705Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:33:05.2057326Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:05.2057639Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:33:05.2058234Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:33:05.2058917Z   |     if method() is not None:
2025-05-07T20:33:05.2059189Z   |        ~~~~~~^^
2025-05-07T20:33:05.2059906Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:05.2060671Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2060992Z   |            ^^^^^^^
2025-05-07T20:33:05.2061595Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:05.2062386Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:05.2062975Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:05.2063517Z   +-+---------------- 1 ----------------
2025-05-07T20:33:05.2063879Z     | Traceback (most recent call last):
2025-05-07T20:33:05.2064970Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:05.2066203Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2069257Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.2072123Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:05.2072759Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2073352Z     |     T=2048,
2025-05-07T20:33:05.2073689Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:05.2074169Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:05.2074700Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:05.2075236Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:05.2075684Z     | )
2025-05-07T20:33:05.2075935Z     | 
2025-05-07T20:33:05.2076699Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:05.2077583Z     +---------------- 2 ----------------
2025-05-07T20:33:05.2078001Z     | Traceback (most recent call last):
2025-05-07T20:33:05.2079031Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:05.2080279Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2083283Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.2085984Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:05.2086450Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2086888Z     |     T=128,
2025-05-07T20:33:05.2087107Z     |     D=7168,
2025-05-07T20:33:05.2087325Z     |     scale_ub=None,
2025-05-07T20:33:05.2087581Z     |     contiguous=True,
2025-05-07T20:33:05.2087906Z     |     compiled=True,
2025-05-07T20:33:05.2088143Z     | )
2025-05-07T20:33:05.2088328Z     | 
2025-05-07T20:33:05.2088925Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:05.2089562Z     +---------------- 3 ----------------
2025-05-07T20:33:05.2089873Z     | Traceback (most recent call last):
2025-05-07T20:33:05.2090613Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:05.2091426Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2093588Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.2095721Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:05.2096185Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2096604Z     |     T=128,
2025-05-07T20:33:05.2096819Z     |     D=5120,
2025-05-07T20:33:05.2097047Z     |     scale_ub=1200.0,
2025-05-07T20:33:05.2097296Z     |     contiguous=True,
2025-05-07T20:33:05.2097551Z     |     compiled=True,
2025-05-07T20:33:05.2097788Z     | )
2025-05-07T20:33:05.2097970Z     | 
2025-05-07T20:33:05.2098522Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:05.2099148Z     +---------------- 4 ----------------
2025-05-07T20:33:05.2099459Z     | Traceback (most recent call last):
2025-05-07T20:33:05.2100208Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:05.2100956Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2101257Z     |                              ~~~~~~^^
2025-05-07T20:33:05.2101924Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:05.2102648Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2103518Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:05.2104347Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2104645Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:33:05.2104922Z     |         a,
2025-05-07T20:33:05.2105133Z     |         ^^
2025-05-07T20:33:05.2105365Z     |     ...<23 lines>...
2025-05-07T20:33:05.2105626Z     |         USE_INT64=use_int64,
2025-05-07T20:33:05.2105903Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.2106162Z     |     )
2025-05-07T20:33:05.2106362Z     |     ^
2025-05-07T20:33:05.2106904Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:05.2107672Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2108146Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.2108819Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:05.2109631Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2110181Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.2110926Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:05.2111647Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2112053Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:05.2112698Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:05.2113283Z     |     fn()
2025-05-07T20:33:05.2113898Z     |     ~~^^
2025-05-07T20:33:05.2114566Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:05.2115479Z     |     self.fn.run(
2025-05-07T20:33:05.2115795Z     |     ~~~~~~~~~~~^
2025-05-07T20:33:05.2116113Z     |         *args,
2025-05-07T20:33:05.2116418Z     |         ^^^^^^
2025-05-07T20:33:05.2116930Z     |         **current,
2025-05-07T20:33:05.2117250Z     |         ^^^^^^^^^^
2025-05-07T20:33:05.2117569Z     |     )
2025-05-07T20:33:05.2117923Z     |     ^
2025-05-07T20:33:05.2118654Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:05.2119504Z     |     kernel = self.compile(
2025-05-07T20:33:05.2119880Z     |         src,
2025-05-07T20:33:05.2120278Z     |         target=target,
2025-05-07T20:33:05.2120665Z     |         options=options.__dict__,
2025-05-07T20:33:05.2121065Z     |     )
2025-05-07T20:33:05.2121850Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:05.2122886Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2123937Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:05.2125106Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2125791Z     |                        module_map=module_map)
2025-05-07T20:33:05.2126318Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2126824Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2127193Z     | ^
2025-05-07T20:33:05.2165832Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2166708Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:05.2167281Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:05.2167998Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2168610Z     |     T=1,  # or any other generated value
2025-05-07T20:33:05.2169039Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:05.2169496Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:05.2169993Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:05.2170487Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:05.2170895Z     | )
2025-05-07T20:33:05.2171146Z     | 
2025-05-07T20:33:05.2171897Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:05.2172766Z     +------------------------------------
2025-05-07T20:33:05.2173282Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:05.2173820Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2174409Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2174964Z     T=1,
2025-05-07T20:33:05.2175422Z     D=5120,
2025-05-07T20:33:05.2175692Z     scale_ub=None,
2025-05-07T20:33:05.2175982Z     contiguous=True,
2025-05-07T20:33:05.2176287Z     compiled=True,
2025-05-07T20:33:05.2176566Z )
2025-05-07T20:33:05.2177085Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2177740Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2178099Z 
2025-05-07T20:33:05.2178210Z     @given(
2025-05-07T20:33:05.2178519Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2178934Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2179351Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2179796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2180236Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2180637Z     )
2025-05-07T20:33:05.2181124Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2181743Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2182083Z         self,
2025-05-07T20:33:05.2182422Z         T: int,
2025-05-07T20:33:05.2182695Z         D: int,
2025-05-07T20:33:05.2183005Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2183448Z         contiguous: bool,
2025-05-07T20:33:05.2183795Z         compiled: bool,
2025-05-07T20:33:05.2184129Z     ) -> None:
2025-05-07T20:33:05.2184460Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2184815Z     
2025-05-07T20:33:05.2185188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2185676Z     
2025-05-07T20:33:05.2185951Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2186358Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2186787Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2187120Z         x0 = x[:, :D]
2025-05-07T20:33:05.2187411Z         x1 = x[:, D:]
2025-05-07T20:33:05.2187696Z     
2025-05-07T20:33:05.2187948Z         if contiguous:
2025-05-07T20:33:05.2188257Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2188607Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2188940Z     
2025-05-07T20:33:05.2189196Z         if scale_ub is not None:
2025-05-07T20:33:05.2189580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2190036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2190476Z             )
2025-05-07T20:33:05.2190751Z         else:
2025-05-07T20:33:05.2191059Z             scale_ub_tensor = None
2025-05-07T20:33:05.2191423Z     
2025-05-07T20:33:05.2191740Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2192192Z             op = silu_mul_quant
2025-05-07T20:33:05.2192539Z             if compiled:
2025-05-07T20:33:05.2192879Z                 op = torch.compile(op)
2025-05-07T20:33:05.2193294Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2193671Z     
2025-05-07T20:33:05.2193943Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2194398Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2194806Z     
2025-05-07T20:33:05.2195129Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2195607Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2196039Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2196491Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2197019Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2197472Z     
2025-05-07T20:33:05.2197756Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2198023Z 
2025-05-07T20:33:05.2198162Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2198582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2199041Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2199484Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2200726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2201884Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2202752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2203734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2204731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2205756Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2206819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2207734Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2208571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2209292Z     fn()
2025-05-07T20:33:05.2210042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2210909Z     self.fn.run(
2025-05-07T20:33:05.2211595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2212369Z     kernel = self.compile(
2025-05-07T20:33:05.2213157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2214398Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2214971Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2215290Z 
2025-05-07T20:33:05.2215570Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7ae5c2270>
2025-05-07T20:33:05.2217057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2218971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a7cae700>}
2025-05-07T20:33:05.2220821Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2222213Z context = <triton._C.libtriton.ir.context object at 0x7ff7f608b2f0>
2025-05-07T20:33:05.2222600Z 
2025-05-07T20:33:05.2222827Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2223546Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2224205Z                            module_map=module_map)
2025-05-07T20:33:05.2224720Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2225218Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2225596Z E       ^
2025-05-07T20:33:05.2226268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2226943Z 
2025-05-07T20:33:05.2227558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2228314Z 
2025-05-07T20:33:05.2228458Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2229050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2229630Z     T=2048,
2025-05-07T20:33:05.2229884Z     D=5120,
2025-05-07T20:33:05.2230148Z     scale_ub=1200.0,
2025-05-07T20:33:05.2230450Z     contiguous=True,
2025-05-07T20:33:05.2230862Z     compiled=False,
2025-05-07T20:33:05.2231145Z )
2025-05-07T20:33:05.2231585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2232335Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.2232723Z 
2025-05-07T20:33:05.2232829Z     @given(
2025-05-07T20:33:05.2233147Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2233569Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2233992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2234474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2234964Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2235379Z     )
2025-05-07T20:33:05.2235895Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2236540Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2236888Z         self,
2025-05-07T20:33:05.2237176Z         T: int,
2025-05-07T20:33:05.2237467Z         D: int,
2025-05-07T20:33:05.2237779Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2238252Z         contiguous: bool,
2025-05-07T20:33:05.2238589Z         compiled: bool,
2025-05-07T20:33:05.2238892Z     ) -> None:
2025-05-07T20:33:05.2239263Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2239601Z     
2025-05-07T20:33:05.2239962Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2240556Z     
2025-05-07T20:33:05.2240823Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2241219Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2241633Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2241966Z         x0 = x[:, :D]
2025-05-07T20:33:05.2242271Z         x1 = x[:, D:]
2025-05-07T20:33:05.2242548Z     
2025-05-07T20:33:05.2242801Z         if contiguous:
2025-05-07T20:33:05.2243116Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2243477Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2243808Z     
2025-05-07T20:33:05.2244083Z         if scale_ub is not None:
2025-05-07T20:33:05.2244490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2245387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2245812Z             )
2025-05-07T20:33:05.2246085Z         else:
2025-05-07T20:33:05.2246396Z             scale_ub_tensor = None
2025-05-07T20:33:05.2246750Z     
2025-05-07T20:33:05.2247066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2247499Z             op = silu_mul_quant
2025-05-07T20:33:05.2268875Z             if compiled:
2025-05-07T20:33:05.2269227Z                 op = torch.compile(op)
2025-05-07T20:33:05.2269633Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2269983Z     
2025-05-07T20:33:05.2270227Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2270453Z 
2025-05-07T20:33:05.2270579Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2270950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2271376Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2271728Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2272673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2273681Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2274520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2275504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2276406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2277146Z     kernel = self.compile(
2025-05-07T20:33:05.2277932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2279014Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2279578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2279919Z 
2025-05-07T20:33:05.2280406Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a7c41090>
2025-05-07T20:33:05.2281954Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2283965Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a7b62020>}
2025-05-07T20:33:05.2285862Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2286938Z context = <triton._C.libtriton.ir.context object at 0x7ff7ac1b5ef0>
2025-05-07T20:33:05.2287307Z 
2025-05-07T20:33:05.2287485Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2288076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2288557Z                            module_map=module_map)
2025-05-07T20:33:05.2288935Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2289302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2289564Z E       ^
2025-05-07T20:33:05.2290047Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2290511Z 
2025-05-07T20:33:05.2290946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2291475Z 
2025-05-07T20:33:05.2291588Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2292003Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2292413Z     T=2048,
2025-05-07T20:33:05.2292604Z     D=5120,
2025-05-07T20:33:05.2292795Z     scale_ub=1200.0,
2025-05-07T20:33:05.2293016Z     contiguous=True,
2025-05-07T20:33:05.2293235Z     compiled=True,
2025-05-07T20:33:05.2293434Z )
2025-05-07T20:33:05.2293758Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2294323Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.2294598Z 
2025-05-07T20:33:05.2294683Z     @given(
2025-05-07T20:33:05.2294908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2295228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2295543Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2295875Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2296216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2296508Z     )
2025-05-07T20:33:05.2296861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2297322Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2297576Z         self,
2025-05-07T20:33:05.2297767Z         T: int,
2025-05-07T20:33:05.2297970Z         D: int,
2025-05-07T20:33:05.2298193Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2298467Z         contiguous: bool,
2025-05-07T20:33:05.2298711Z         compiled: bool,
2025-05-07T20:33:05.2298937Z     ) -> None:
2025-05-07T20:33:05.2299159Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2299398Z     
2025-05-07T20:33:05.2299677Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2300028Z     
2025-05-07T20:33:05.2300219Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2300520Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2300842Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2301136Z         x0 = x[:, :D]
2025-05-07T20:33:05.2301358Z         x1 = x[:, D:]
2025-05-07T20:33:05.2301573Z     
2025-05-07T20:33:05.2301757Z         if contiguous:
2025-05-07T20:33:05.2302039Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2302313Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2302557Z     
2025-05-07T20:33:05.2302756Z         if scale_ub is not None:
2025-05-07T20:33:05.2303040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2303383Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2303705Z             )
2025-05-07T20:33:05.2303902Z         else:
2025-05-07T20:33:05.2304106Z             scale_ub_tensor = None
2025-05-07T20:33:05.2304389Z     
2025-05-07T20:33:05.2304651Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2304978Z             op = silu_mul_quant
2025-05-07T20:33:05.2305234Z             if compiled:
2025-05-07T20:33:05.2305489Z                 op = torch.compile(op)
2025-05-07T20:33:05.2305798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2306126Z     
2025-05-07T20:33:05.2306324Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2306624Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2306958Z     
2025-05-07T20:33:05.2307204Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2307552Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2307846Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2308172Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2308544Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2308864Z     
2025-05-07T20:33:05.2309061Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2309264Z 
2025-05-07T20:33:05.2309368Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2309669Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2310010Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2310348Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2311172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2311950Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2312502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2313201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2314237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2314979Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2315740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2316406Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2317029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2317559Z     fn()
2025-05-07T20:33:05.2318082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2318678Z     self.fn.run(
2025-05-07T20:33:05.2319149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2319693Z     kernel = self.compile(
2025-05-07T20:33:05.2320347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2321021Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2321423Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2321838Z 
2025-05-07T20:33:05.2322052Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a7c420d0>
2025-05-07T20:33:05.2323244Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2324666Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6c44400>}
2025-05-07T20:33:05.2326050Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2327101Z context = <triton._C.libtriton.ir.context object at 0x7ff7a681d2b0>
2025-05-07T20:33:05.2327403Z 
2025-05-07T20:33:05.2327577Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2328115Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2328765Z                            module_map=module_map)
2025-05-07T20:33:05.2329145Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2329513Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2329787Z E       ^
2025-05-07T20:33:05.2330256Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2330725Z 
2025-05-07T20:33:05.2331151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2331674Z 
2025-05-07T20:33:05.2331782Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2332207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2332616Z     T=16384,
2025-05-07T20:33:05.2332809Z     D=7168,
2025-05-07T20:33:05.2333005Z     scale_ub=1200.0,
2025-05-07T20:33:05.2333235Z     contiguous=False,
2025-05-07T20:33:05.2333469Z     compiled=False,
2025-05-07T20:33:05.2333675Z )
2025-05-07T20:33:05.2333992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2334508Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2334796Z 
2025-05-07T20:33:05.2334877Z     @given(
2025-05-07T20:33:05.2335106Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2335426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2335738Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2336074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2336407Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2336699Z     )
2025-05-07T20:33:05.2337053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2337500Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2337753Z         self,
2025-05-07T20:33:05.2337958Z         T: int,
2025-05-07T20:33:05.2338160Z         D: int,
2025-05-07T20:33:05.2338388Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2338674Z         contiguous: bool,
2025-05-07T20:33:05.2338915Z         compiled: bool,
2025-05-07T20:33:05.2339147Z     ) -> None:
2025-05-07T20:33:05.2339375Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2339621Z     
2025-05-07T20:33:05.2339908Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2340258Z     
2025-05-07T20:33:05.2340457Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2340750Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2341067Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2341314Z         x0 = x[:, :D]
2025-05-07T20:33:05.2341530Z         x1 = x[:, D:]
2025-05-07T20:33:05.2341803Z     
2025-05-07T20:33:05.2341994Z         if contiguous:
2025-05-07T20:33:05.2342225Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2342495Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2342739Z     
2025-05-07T20:33:05.2342980Z         if scale_ub is not None:
2025-05-07T20:33:05.2343262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2343608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2343918Z             )
2025-05-07T20:33:05.2344114Z         else:
2025-05-07T20:33:05.2344326Z             scale_ub_tensor = None
2025-05-07T20:33:05.2344579Z     
2025-05-07T20:33:05.2344807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2345126Z             op = silu_mul_quant
2025-05-07T20:33:05.2345378Z             if compiled:
2025-05-07T20:33:05.2345624Z                 op = torch.compile(op)
2025-05-07T20:33:05.2345925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2346204Z     
2025-05-07T20:33:05.2346397Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2346570Z 
2025-05-07T20:33:05.2346671Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2347022Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2347394Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2347686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2348394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2349102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2349665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2350359Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2351038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2351586Z     kernel = self.compile(
2025-05-07T20:33:05.2352140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2352813Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2353221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2353456Z 
2025-05-07T20:33:05.2353676Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a69c5220>
2025-05-07T20:33:05.2354840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2356248Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6a271a0>}
2025-05-07T20:33:05.2357635Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2358694Z context = <triton._C.libtriton.ir.context object at 0x7ff7a685d930>
2025-05-07T20:33:05.2358994Z 
2025-05-07T20:33:05.2359174Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2359709Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2360281Z                            module_map=module_map)
2025-05-07T20:33:05.2360657Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2361018Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2361281Z E       ^
2025-05-07T20:33:05.2361760Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2362221Z 
2025-05-07T20:33:05.2362714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2363240Z 
2025-05-07T20:33:05.2363386Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2363816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2364238Z     T=1,
2025-05-07T20:33:05.2364453Z     D=7168,
2025-05-07T20:33:05.2364655Z     scale_ub=None,
2025-05-07T20:33:05.2364873Z     contiguous=True,
2025-05-07T20:33:05.2365100Z     compiled=True,
2025-05-07T20:33:05.2365298Z )
2025-05-07T20:33:05.2365623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2366121Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2366385Z 
2025-05-07T20:33:05.2366465Z     @given(
2025-05-07T20:33:05.2366698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2367016Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2367330Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2367670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2368060Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2368392Z     )
2025-05-07T20:33:05.2368749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2369201Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2369446Z         self,
2025-05-07T20:33:05.2369636Z         T: int,
2025-05-07T20:33:05.2369835Z         D: int,
2025-05-07T20:33:05.2370056Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2370328Z         contiguous: bool,
2025-05-07T20:33:05.2370574Z         compiled: bool,
2025-05-07T20:33:05.2370798Z     ) -> None:
2025-05-07T20:33:05.2371014Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2371259Z     
2025-05-07T20:33:05.2371540Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2371886Z     
2025-05-07T20:33:05.2372092Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2372390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2372714Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2372958Z         x0 = x[:, :D]
2025-05-07T20:33:05.2373184Z         x1 = x[:, D:]
2025-05-07T20:33:05.2373395Z     
2025-05-07T20:33:05.2373578Z         if contiguous:
2025-05-07T20:33:05.2373814Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2374081Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2374321Z     
2025-05-07T20:33:05.2374518Z         if scale_ub is not None:
2025-05-07T20:33:05.2374799Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2375142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2375459Z             )
2025-05-07T20:33:05.2375658Z         else:
2025-05-07T20:33:05.2375868Z             scale_ub_tensor = None
2025-05-07T20:33:05.2376128Z     
2025-05-07T20:33:05.2376361Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2376678Z             op = silu_mul_quant
2025-05-07T20:33:05.2376933Z             if compiled:
2025-05-07T20:33:05.2377189Z                 op = torch.compile(op)
2025-05-07T20:33:05.2377492Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2377773Z     
2025-05-07T20:33:05.2377968Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2378263Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2378553Z     
2025-05-07T20:33:05.2378792Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2379140Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2379436Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2379757Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2380126Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2380440Z     
2025-05-07T20:33:05.2380647Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2380902Z 
2025-05-07T20:33:05.2381009Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2381308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2381686Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2382025Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2382837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2383602Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2384190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2384913Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2385620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2386360Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2387120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2387886Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2388509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2389033Z     fn()
2025-05-07T20:33:05.2389554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2390150Z     self.fn.run(
2025-05-07T20:33:05.2390624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2391166Z     kernel = self.compile(
2025-05-07T20:33:05.2391718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2392385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2392800Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2393041Z 
2025-05-07T20:33:05.2393256Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a69c7950>
2025-05-07T20:33:05.2394395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2395833Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6d00860>}
2025-05-07T20:33:05.2397212Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2398266Z context = <triton._C.libtriton.ir.context object at 0x7ff781dc9af0>
2025-05-07T20:33:05.2398570Z 
2025-05-07T20:33:05.2398743Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2399285Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2399759Z                            module_map=module_map)
2025-05-07T20:33:05.2400226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2400597Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2400866Z E       ^
2025-05-07T20:33:05.2401343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2401811Z 
2025-05-07T20:33:05.2402239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2402824Z 
2025-05-07T20:33:05.2402935Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2403356Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2403770Z     T=4096,
2025-05-07T20:33:05.2404003Z     D=5120,
2025-05-07T20:33:05.2404204Z     scale_ub=None,
2025-05-07T20:33:05.2404424Z     contiguous=False,
2025-05-07T20:33:05.2404656Z     compiled=False,
2025-05-07T20:33:05.2404857Z )
2025-05-07T20:33:05.2405181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2405904Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.2406184Z 
2025-05-07T20:33:05.2406268Z     @given(
2025-05-07T20:33:05.2406497Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2406817Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2407131Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2407465Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2407809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2408100Z     )
2025-05-07T20:33:05.2408512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2409009Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2409258Z         self,
2025-05-07T20:33:05.2409453Z         T: int,
2025-05-07T20:33:05.2409661Z         D: int,
2025-05-07T20:33:05.2409886Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2410172Z         contiguous: bool,
2025-05-07T20:33:05.2410413Z         compiled: bool,
2025-05-07T20:33:05.2410642Z     ) -> None:
2025-05-07T20:33:05.2410863Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2411106Z     
2025-05-07T20:33:05.2411392Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2411747Z     
2025-05-07T20:33:05.2411942Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2412243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2412565Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2412809Z         x0 = x[:, :D]
2025-05-07T20:33:05.2413036Z         x1 = x[:, D:]
2025-05-07T20:33:05.2413254Z     
2025-05-07T20:33:05.2413744Z         if contiguous:
2025-05-07T20:33:05.2413998Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2414308Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2414551Z     
2025-05-07T20:33:05.2414752Z         if scale_ub is not None:
2025-05-07T20:33:05.2415041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2415390Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2415704Z             )
2025-05-07T20:33:05.2415906Z         else:
2025-05-07T20:33:05.2416120Z             scale_ub_tensor = None
2025-05-07T20:33:05.2416377Z     
2025-05-07T20:33:05.2416616Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2416940Z             op = silu_mul_quant
2025-05-07T20:33:05.2417190Z             if compiled:
2025-05-07T20:33:05.2417447Z                 op = torch.compile(op)
2025-05-07T20:33:05.2417752Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2418034Z     
2025-05-07T20:33:05.2418239Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2418411Z 
2025-05-07T20:33:05.2418520Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2418815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2418929Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2419032Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2419553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2419657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2420026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2420263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2420751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2420855Z     kernel = self.compile(
2025-05-07T20:33:05.2421316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2421501Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2421639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2421644Z 
2025-05-07T20:33:05.2421852Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781d78b90>
2025-05-07T20:33:05.2422651Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2423183Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6118cc0>}
2025-05-07T20:33:05.2424089Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2424293Z context = <triton._C.libtriton.ir.context object at 0x7ff781df2bf0>
2025-05-07T20:33:05.2424298Z 
2025-05-07T20:33:05.2424467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2424745Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2424855Z                            module_map=module_map)
2025-05-07T20:33:05.2425020Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2425127Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2425205Z E       ^
2025-05-07T20:33:05.2425571Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2425580Z 
2025-05-07T20:33:05.2426018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2426023Z 
2025-05-07T20:33:05.2426129Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2426364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2426442Z     T=4096,
2025-05-07T20:33:05.2426519Z     D=7168,
2025-05-07T20:33:05.2426610Z     scale_ub=None,
2025-05-07T20:33:05.2426698Z     contiguous=False,
2025-05-07T20:33:05.2426785Z     compiled=False,
2025-05-07T20:33:05.2426865Z )
2025-05-07T20:33:05.2434695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2434916Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.2434927Z 
2025-05-07T20:33:05.2435010Z     @given(
2025-05-07T20:33:05.2435146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2435257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2435383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2435514Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2435634Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2435719Z     )
2025-05-07T20:33:05.2435977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2436077Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2436165Z         self,
2025-05-07T20:33:05.2436247Z         T: int,
2025-05-07T20:33:05.2436328Z         D: int,
2025-05-07T20:33:05.2436440Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2436534Z         contiguous: bool,
2025-05-07T20:33:05.2436626Z         compiled: bool,
2025-05-07T20:33:05.2436717Z     ) -> None:
2025-05-07T20:33:05.2436815Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2436978Z     
2025-05-07T20:33:05.2437164Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2437245Z     
2025-05-07T20:33:05.2437390Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2437525Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2437619Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2437712Z         x0 = x[:, :D]
2025-05-07T20:33:05.2437794Z         x1 = x[:, D:]
2025-05-07T20:33:05.2437869Z     
2025-05-07T20:33:05.2437965Z         if contiguous:
2025-05-07T20:33:05.2438062Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2438158Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2438240Z     
2025-05-07T20:33:05.2438333Z         if scale_ub is not None:
2025-05-07T20:33:05.2438443Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2438589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2438669Z             )
2025-05-07T20:33:05.2438760Z         else:
2025-05-07T20:33:05.2438860Z             scale_ub_tensor = None
2025-05-07T20:33:05.2438982Z     
2025-05-07T20:33:05.2439125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2439259Z             op = silu_mul_quant
2025-05-07T20:33:05.2439349Z             if compiled:
2025-05-07T20:33:05.2439459Z                 op = torch.compile(op)
2025-05-07T20:33:05.2439569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2439645Z     
2025-05-07T20:33:05.2439749Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2439754Z 
2025-05-07T20:33:05.2439858Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2439992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2440174Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2440280Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2440809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2440913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2441295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2441539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2441891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2441994Z     kernel = self.compile(
2025-05-07T20:33:05.2442392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2442574Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2442712Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2442717Z 
2025-05-07T20:33:05.2442929Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a6ac2be0>
2025-05-07T20:33:05.2443748Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2444325Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a6119260>}
2025-05-07T20:33:05.2445094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2445301Z context = <triton._C.libtriton.ir.context object at 0x7ff781a484f0>
2025-05-07T20:33:05.2445305Z 
2025-05-07T20:33:05.2445475Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2445758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2445922Z                            module_map=module_map)
2025-05-07T20:33:05.2446137Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2446250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2446330Z E       ^
2025-05-07T20:33:05.2446695Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2446707Z 
2025-05-07T20:33:05.2447134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2447139Z 
2025-05-07T20:33:05.2447245Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2447484Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2447563Z     T=128,
2025-05-07T20:33:05.2447645Z     D=7168,
2025-05-07T20:33:05.2447737Z     scale_ub=None,
2025-05-07T20:33:05.2447829Z     contiguous=False,
2025-05-07T20:33:05.2447913Z     compiled=True,
2025-05-07T20:33:05.2447995Z )
2025-05-07T20:33:05.2448290Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2448517Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2448522Z 
2025-05-07T20:33:05.2448601Z     @given(
2025-05-07T20:33:05.2448724Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2448834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2448951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2449072Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2449196Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2449276Z     )
2025-05-07T20:33:05.2449529Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2449632Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2449712Z         self,
2025-05-07T20:33:05.2449802Z         T: int,
2025-05-07T20:33:05.2449880Z         D: int,
2025-05-07T20:33:05.2449986Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2450084Z         contiguous: bool,
2025-05-07T20:33:05.2450175Z         compiled: bool,
2025-05-07T20:33:05.2450260Z     ) -> None:
2025-05-07T20:33:05.2450366Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2450442Z     
2025-05-07T20:33:05.2450620Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2450703Z     
2025-05-07T20:33:05.2450799Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2450929Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2451028Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2451111Z         x0 = x[:, :D]
2025-05-07T20:33:05.2451203Z         x1 = x[:, D:]
2025-05-07T20:33:05.2451279Z     
2025-05-07T20:33:05.2451365Z         if contiguous:
2025-05-07T20:33:05.2451467Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2451559Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2451638Z     
2025-05-07T20:33:05.2451742Z         if scale_ub is not None:
2025-05-07T20:33:05.2451854Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2452000Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2452088Z             )
2025-05-07T20:33:05.2452167Z         else:
2025-05-07T20:33:05.2452265Z             scale_ub_tensor = None
2025-05-07T20:33:05.2452347Z     
2025-05-07T20:33:05.2452480Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2452574Z             op = silu_mul_quant
2025-05-07T20:33:05.2452669Z             if compiled:
2025-05-07T20:33:05.2452772Z                 op = torch.compile(op)
2025-05-07T20:33:05.2452889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2452964Z     
2025-05-07T20:33:05.2453057Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2453188Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2453314Z     
2025-05-07T20:33:05.2453456Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2453572Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2453716Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2453844Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2454007Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2454084Z     
2025-05-07T20:33:05.2454196Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2454200Z 
2025-05-07T20:33:05.2454302Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2454437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2454555Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2454693Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2455270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2455386Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2455804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2456081Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2456462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2456729Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2457125Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2457297Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2457658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2457741Z     fn()
2025-05-07T20:33:05.2458156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2458251Z     self.fn.run(
2025-05-07T20:33:05.2458606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2458703Z     kernel = self.compile(
2025-05-07T20:33:05.2459107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2459288Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2459427Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2459432Z 
2025-05-07T20:33:05.2459647Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a61c99d0>
2025-05-07T20:33:05.2460449Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2460991Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7a611b420>}
2025-05-07T20:33:05.2461758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2461962Z context = <triton._C.libtriton.ir.context object at 0x7ff7a6217bf0>
2025-05-07T20:33:05.2461966Z 
2025-05-07T20:33:05.2462137Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2462410Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2462527Z                            module_map=module_map)
2025-05-07T20:33:05.2462741Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2462853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2462937Z E       ^
2025-05-07T20:33:05.2463343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2463348Z 
2025-05-07T20:33:05.2463785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2463789Z 
2025-05-07T20:33:05.2463897Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2464170Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2464264Z     T=128,
2025-05-07T20:33:05.2464342Z     D=7168,
2025-05-07T20:33:05.2464432Z     scale_ub=None,
2025-05-07T20:33:05.2464520Z     contiguous=False,
2025-05-07T20:33:05.2464605Z     compiled=False,
2025-05-07T20:33:05.2464687Z )
2025-05-07T20:33:05.2464911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2465090Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.2465137Z 
2025-05-07T20:33:05.2465228Z     @given(
2025-05-07T20:33:05.2465391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2465495Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2465621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2465742Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2465869Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2465946Z     )
2025-05-07T20:33:05.2466200Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2466303Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2466382Z         self,
2025-05-07T20:33:05.2466465Z         T: int,
2025-05-07T20:33:05.2466552Z         D: int,
2025-05-07T20:33:05.2466655Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2466751Z         contiguous: bool,
2025-05-07T20:33:05.2466848Z         compiled: bool,
2025-05-07T20:33:05.2466934Z     ) -> None:
2025-05-07T20:33:05.2467041Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2467123Z     
2025-05-07T20:33:05.2467303Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2467387Z     
2025-05-07T20:33:05.2467483Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2467613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2467714Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2467798Z         x0 = x[:, :D]
2025-05-07T20:33:05.2467881Z         x1 = x[:, D:]
2025-05-07T20:33:05.2467965Z     
2025-05-07T20:33:05.2468052Z         if contiguous:
2025-05-07T20:33:05.2468146Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2468246Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2468323Z     
2025-05-07T20:33:05.2468419Z         if scale_ub is not None:
2025-05-07T20:33:05.2468538Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2468677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2468767Z             )
2025-05-07T20:33:05.2468845Z         else:
2025-05-07T20:33:05.2468948Z             scale_ub_tensor = None
2025-05-07T20:33:05.2469032Z     
2025-05-07T20:33:05.2469165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2469260Z             op = silu_mul_quant
2025-05-07T20:33:05.2469355Z             if compiled:
2025-05-07T20:33:05.2469458Z                 op = torch.compile(op)
2025-05-07T20:33:05.2469571Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2469653Z     
2025-05-07T20:33:05.2469747Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2469751Z 
2025-05-07T20:33:05.2469860Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2469992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2470096Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2470257Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2470811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2470921Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2471302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2471532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2471892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2471989Z     kernel = self.compile(
2025-05-07T20:33:05.2472384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2472576Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2472708Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2472715Z 
2025-05-07T20:33:05.2472969Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a6898f50>
2025-05-07T20:33:05.2473814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2474391Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c60a40>}
2025-05-07T20:33:05.2475163Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2475361Z context = <triton._C.libtriton.ir.context object at 0x7ff7a624b2f0>
2025-05-07T20:33:05.2475368Z 
2025-05-07T20:33:05.2475546Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2475825Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2475939Z                            module_map=module_map)
2025-05-07T20:33:05.2476114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2476217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2476297Z E       ^
2025-05-07T20:33:05.2476669Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2476673Z 
2025-05-07T20:33:05.2477101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2477106Z 
2025-05-07T20:33:05.2477218Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2477447Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2477528Z     T=4096,
2025-05-07T20:33:05.2477612Z     D=5120,
2025-05-07T20:33:05.2477701Z     scale_ub=1200.0,
2025-05-07T20:33:05.2477787Z     contiguous=True,
2025-05-07T20:33:05.2477881Z     compiled=False,
2025-05-07T20:33:05.2477957Z )
2025-05-07T20:33:05.2478190Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2478371Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.2478376Z 
2025-05-07T20:33:05.2478454Z     @given(
2025-05-07T20:33:05.2478585Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2478688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2478805Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2478933Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2479049Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2479124Z     )
2025-05-07T20:33:05.2479431Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2479530Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2479615Z         self,
2025-05-07T20:33:05.2479756Z         T: int,
2025-05-07T20:33:05.2479837Z         D: int,
2025-05-07T20:33:05.2479945Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2480038Z         contiguous: bool,
2025-05-07T20:33:05.2480185Z         compiled: bool,
2025-05-07T20:33:05.2480273Z     ) -> None:
2025-05-07T20:33:05.2480373Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2480449Z     
2025-05-07T20:33:05.2480636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2480712Z     
2025-05-07T20:33:05.2480807Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2480944Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2481038Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2481130Z         x0 = x[:, :D]
2025-05-07T20:33:05.2481213Z         x1 = x[:, D:]
2025-05-07T20:33:05.2481293Z     
2025-05-07T20:33:05.2481387Z         if contiguous:
2025-05-07T20:33:05.2481529Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2481623Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2481708Z     
2025-05-07T20:33:05.2481842Z         if scale_ub is not None:
2025-05-07T20:33:05.2481953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2482102Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2482185Z             )
2025-05-07T20:33:05.2482265Z         else:
2025-05-07T20:33:05.2482371Z             scale_ub_tensor = None
2025-05-07T20:33:05.2482447Z     
2025-05-07T20:33:05.2482582Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2482684Z             op = silu_mul_quant
2025-05-07T20:33:05.2482771Z             if compiled:
2025-05-07T20:33:05.2482884Z                 op = torch.compile(op)
2025-05-07T20:33:05.2482995Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2483073Z     
2025-05-07T20:33:05.2483176Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2483180Z 
2025-05-07T20:33:05.2483285Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2483424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2483536Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2483640Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2484157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2484269Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2484639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2484870Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2485235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2485335Z     kernel = self.compile(
2025-05-07T20:33:05.2485733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2485933Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2486064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2486069Z 
2025-05-07T20:33:05.2486288Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a689b750>
2025-05-07T20:33:05.2487090Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2487618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c60ea0>}
2025-05-07T20:33:05.2488492Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2488700Z context = <triton._C.libtriton.ir.context object at 0x7ff7a629b730>
2025-05-07T20:33:05.2488705Z 
2025-05-07T20:33:05.2488884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2489159Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2489280Z                            module_map=module_map)
2025-05-07T20:33:05.2489447Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2489551Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2489637Z E       ^
2025-05-07T20:33:05.2490004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2490012Z 
2025-05-07T20:33:05.2490439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2490489Z 
2025-05-07T20:33:05.2490600Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2490870Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2490959Z     T=1,
2025-05-07T20:33:05.2491038Z     D=5120,
2025-05-07T20:33:05.2491123Z     scale_ub=None,
2025-05-07T20:33:05.2491219Z     contiguous=True,
2025-05-07T20:33:05.2491304Z     compiled=True,
2025-05-07T20:33:05.2491381Z )
2025-05-07T20:33:05.2491613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2491781Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2491785Z 
2025-05-07T20:33:05.2491865Z     @given(
2025-05-07T20:33:05.2491994Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2492102Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2492226Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2492350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2492473Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2492555Z     )
2025-05-07T20:33:05.2492809Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2492906Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2492989Z         self,
2025-05-07T20:33:05.2493068Z         T: int,
2025-05-07T20:33:05.2493146Z         D: int,
2025-05-07T20:33:05.2493254Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2493346Z         contiguous: bool,
2025-05-07T20:33:05.2493443Z         compiled: bool,
2025-05-07T20:33:05.2493522Z     ) -> None:
2025-05-07T20:33:05.2493620Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2493703Z     
2025-05-07T20:33:05.2493877Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2493955Z     
2025-05-07T20:33:05.2494055Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2494186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2494280Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2494368Z         x0 = x[:, :D]
2025-05-07T20:33:05.2494450Z         x1 = x[:, D:]
2025-05-07T20:33:05.2494527Z     
2025-05-07T20:33:05.2494619Z         if contiguous:
2025-05-07T20:33:05.2494714Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2494806Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2494888Z     
2025-05-07T20:33:05.2494984Z         if scale_ub is not None:
2025-05-07T20:33:05.2495098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2495241Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2495325Z             )
2025-05-07T20:33:05.2495404Z         else:
2025-05-07T20:33:05.2495500Z             scale_ub_tensor = None
2025-05-07T20:33:05.2495583Z     
2025-05-07T20:33:05.2495769Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2495867Z             op = silu_mul_quant
2025-05-07T20:33:05.2495962Z             if compiled:
2025-05-07T20:33:05.2496110Z                 op = torch.compile(op)
2025-05-07T20:33:05.2496221Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2496302Z     
2025-05-07T20:33:05.2496395Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2496527Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2496602Z     
2025-05-07T20:33:05.2496742Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2496854Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2496958Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2497086Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2497243Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2497318Z     
2025-05-07T20:33:05.2497421Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2497428Z 
2025-05-07T20:33:05.2497539Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2497715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2497872Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2498018Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2498598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2498709Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2499080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2499319Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2499698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2499968Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2500369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2500543Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2500896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2500982Z     fn()
2025-05-07T20:33:05.2501398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2501490Z     self.fn.run(
2025-05-07T20:33:05.2501842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2501940Z     kernel = self.compile(
2025-05-07T20:33:05.2502337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2502521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2502658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2502673Z 
2025-05-07T20:33:05.2502887Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a64caa80>
2025-05-07T20:33:05.2503692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2504270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c62c00>}
2025-05-07T20:33:05.2505038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2505290Z context = <triton._C.libtriton.ir.context object at 0x7ff7a629ea30>
2025-05-07T20:33:05.2505297Z 
2025-05-07T20:33:05.2505509Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2505788Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2505905Z                            module_map=module_map)
2025-05-07T20:33:05.2506075Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2506181Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2506266Z E       ^
2025-05-07T20:33:05.2506630Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2506635Z 
2025-05-07T20:33:05.2507068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2507076Z 
2025-05-07T20:33:05.2507182Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2507480Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2507568Z     T=2048,
2025-05-07T20:33:05.2507685Z     D=5120,
2025-05-07T20:33:05.2507778Z     scale_ub=None,
2025-05-07T20:33:05.2507865Z     contiguous=True,
2025-05-07T20:33:05.2507950Z     compiled=True,
2025-05-07T20:33:05.2508030Z )
2025-05-07T20:33:05.2508256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2508433Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2508437Z 
2025-05-07T20:33:05.2508523Z     @given(
2025-05-07T20:33:05.2508650Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2508752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2508880Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2509002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2509128Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2509207Z     )
2025-05-07T20:33:05.2509462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2509566Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2509644Z         self,
2025-05-07T20:33:05.2509723Z         T: int,
2025-05-07T20:33:05.2509807Z         D: int,
2025-05-07T20:33:05.2509908Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2510000Z         contiguous: bool,
2025-05-07T20:33:05.2510095Z         compiled: bool,
2025-05-07T20:33:05.2510175Z     ) -> None:
2025-05-07T20:33:05.2510274Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2510356Z     
2025-05-07T20:33:05.2510531Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2510615Z     
2025-05-07T20:33:05.2510710Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2510839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2510943Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2511026Z         x0 = x[:, :D]
2025-05-07T20:33:05.2511112Z         x1 = x[:, D:]
2025-05-07T20:33:05.2511191Z     
2025-05-07T20:33:05.2511279Z         if contiguous:
2025-05-07T20:33:05.2511375Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2511474Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2511549Z     
2025-05-07T20:33:05.2511643Z         if scale_ub is not None:
2025-05-07T20:33:05.2511759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2511901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2511977Z             )
2025-05-07T20:33:05.2512064Z         else:
2025-05-07T20:33:05.2512162Z             scale_ub_tensor = None
2025-05-07T20:33:05.2512244Z     
2025-05-07T20:33:05.2512381Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2512473Z             op = silu_mul_quant
2025-05-07T20:33:05.2512566Z             if compiled:
2025-05-07T20:33:05.2512743Z                 op = torch.compile(op)
2025-05-07T20:33:05.2512852Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2512936Z     
2025-05-07T20:33:05.2513068Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2513197Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2513279Z     
2025-05-07T20:33:05.2513718Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2513869Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2514003Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2514129Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2514285Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2514360Z     
2025-05-07T20:33:05.2514462Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2514466Z 
2025-05-07T20:33:05.2514573Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2514705Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2514817Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2515124Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2515769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2515881Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2516250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2516482Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2516868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2517139Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2517535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2517712Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2518067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2518151Z     fn()
2025-05-07T20:33:05.2518564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2518647Z     self.fn.run(
2025-05-07T20:33:05.2519007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2519102Z     kernel = self.compile(
2025-05-07T20:33:05.2519498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2519678Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2519809Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2519817Z 
2025-05-07T20:33:05.2520033Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a64cab70>
2025-05-07T20:33:05.2520904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2521433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781c7d1c0>}
2025-05-07T20:33:05.2522201Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2522398Z context = <triton._C.libtriton.ir.context object at 0x7ff7a60a7030>
2025-05-07T20:33:05.2522475Z 
2025-05-07T20:33:05.2522652Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2522991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2523108Z                            module_map=module_map)
2025-05-07T20:33:05.2523271Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2523378Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2523462Z E       ^
2025-05-07T20:33:05.2523826Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2523830Z 
2025-05-07T20:33:05.2524262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2524266Z 
2025-05-07T20:33:05.2524372Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2524601Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2524689Z     T=128,
2025-05-07T20:33:05.2524808Z     D=5120,
2025-05-07T20:33:05.2524894Z     scale_ub=None,
2025-05-07T20:33:05.2524989Z     contiguous=True,
2025-05-07T20:33:05.2525111Z     compiled=True,
2025-05-07T20:33:05.2525187Z )
2025-05-07T20:33:05.2525423Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2525596Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2525600Z 
2025-05-07T20:33:05.2525684Z     @given(
2025-05-07T20:33:05.2525805Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2525906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2526028Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2526146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2526261Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2526343Z     )
2025-05-07T20:33:05.2526597Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2526696Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2526780Z         self,
2025-05-07T20:33:05.2526860Z         T: int,
2025-05-07T20:33:05.2526945Z         D: int,
2025-05-07T20:33:05.2527046Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2527138Z         contiguous: bool,
2025-05-07T20:33:05.2527230Z         compiled: bool,
2025-05-07T20:33:05.2527310Z     ) -> None:
2025-05-07T20:33:05.2527407Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2527485Z     
2025-05-07T20:33:05.2527661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2527738Z     
2025-05-07T20:33:05.2527838Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2527967Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2528058Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2528145Z         x0 = x[:, :D]
2025-05-07T20:33:05.2528230Z         x1 = x[:, D:]
2025-05-07T20:33:05.2528303Z     
2025-05-07T20:33:05.2528394Z         if contiguous:
2025-05-07T20:33:05.2528489Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2528588Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2528661Z     
2025-05-07T20:33:05.2528756Z         if scale_ub is not None:
2025-05-07T20:33:05.2528869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2529005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2529083Z             )
2025-05-07T20:33:05.2529170Z         else:
2025-05-07T20:33:05.2529266Z             scale_ub_tensor = None
2025-05-07T20:33:05.2529341Z     
2025-05-07T20:33:05.2529478Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2529570Z             op = silu_mul_quant
2025-05-07T20:33:05.2529656Z             if compiled:
2025-05-07T20:33:05.2529770Z                 op = torch.compile(op)
2025-05-07T20:33:05.2529879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2530013Z     
2025-05-07T20:33:05.2530107Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2530235Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2530315Z     
2025-05-07T20:33:05.2530496Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2530602Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2530711Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2530834Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2530978Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2531059Z     
2025-05-07T20:33:05.2531164Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2531169Z 
2025-05-07T20:33:05.2531274Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2531403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2531509Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2531655Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2532230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2532415Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2532794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2533026Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2533411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2533679Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2534072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2534276Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2534643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2534730Z     fn()
2025-05-07T20:33:05.2535146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2535229Z     self.fn.run(
2025-05-07T20:33:05.2535581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2535676Z     kernel = self.compile(
2025-05-07T20:33:05.2536067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2536253Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2536381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2536386Z 
2025-05-07T20:33:05.2536602Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a60e5ef0>
2025-05-07T20:33:05.2537407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2537931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78162ad40>}
2025-05-07T20:33:05.2538706Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2538904Z context = <triton._C.libtriton.ir.context object at 0x7ff781b6adf0>
2025-05-07T20:33:05.2538908Z 
2025-05-07T20:33:05.2539086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2539404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2539515Z                            module_map=module_map)
2025-05-07T20:33:05.2539731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2539838Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2539921Z E       ^
2025-05-07T20:33:05.2540285Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2540289Z 
2025-05-07T20:33:05.2540715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2540719Z 
2025-05-07T20:33:05.2540831Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2541060Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2541144Z     T=4096,
2025-05-07T20:33:05.2541223Z     D=5120,
2025-05-07T20:33:05.2541309Z     scale_ub=None,
2025-05-07T20:33:05.2541401Z     contiguous=True,
2025-05-07T20:33:05.2541487Z     compiled=True,
2025-05-07T20:33:05.2541605Z )
2025-05-07T20:33:05.2541835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2542075Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2542080Z 
2025-05-07T20:33:05.2542160Z     @given(
2025-05-07T20:33:05.2542286Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2542387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2542505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2542630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2542747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2542829Z     )
2025-05-07T20:33:05.2543082Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2543179Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2543264Z         self,
2025-05-07T20:33:05.2543341Z         T: int,
2025-05-07T20:33:05.2543422Z         D: int,
2025-05-07T20:33:05.2543528Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2543621Z         contiguous: bool,
2025-05-07T20:33:05.2543712Z         compiled: bool,
2025-05-07T20:33:05.2543800Z     ) -> None:
2025-05-07T20:33:05.2543897Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2543971Z     
2025-05-07T20:33:05.2544157Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2544232Z     
2025-05-07T20:33:05.2544356Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2544501Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2544598Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2544687Z         x0 = x[:, :D]
2025-05-07T20:33:05.2544768Z         x1 = x[:, D:]
2025-05-07T20:33:05.2544841Z     
2025-05-07T20:33:05.2544933Z         if contiguous:
2025-05-07T20:33:05.2545025Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2545118Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2545198Z     
2025-05-07T20:33:05.2545294Z         if scale_ub is not None:
2025-05-07T20:33:05.2545405Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2545551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2545627Z             )
2025-05-07T20:33:05.2545711Z         else:
2025-05-07T20:33:05.2545807Z             scale_ub_tensor = None
2025-05-07T20:33:05.2545881Z     
2025-05-07T20:33:05.2546022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2546114Z             op = silu_mul_quant
2025-05-07T20:33:05.2546199Z             if compiled:
2025-05-07T20:33:05.2546307Z                 op = torch.compile(op)
2025-05-07T20:33:05.2546415Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2546489Z     
2025-05-07T20:33:05.2546591Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2546715Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2546841Z     
2025-05-07T20:33:05.2546986Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2547092Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2547248Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2547374Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2547517Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2547597Z     
2025-05-07T20:33:05.2547700Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2547704Z 
2025-05-07T20:33:05.2547806Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2547944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2548053Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2548191Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2548769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2548877Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2549348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2549579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2549955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2550227Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2550612Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2550790Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2551140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2551221Z     fn()
2025-05-07T20:33:05.2551639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2551729Z     self.fn.run(
2025-05-07T20:33:05.2552080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2552181Z     kernel = self.compile(
2025-05-07T20:33:05.2552574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2552758Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2552887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2552892Z 
2025-05-07T20:33:05.2553103Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781ba6b30>
2025-05-07T20:33:05.2553910Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2554481Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7815ea660>}
2025-05-07T20:33:05.2555267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2555463Z context = <triton._C.libtriton.ir.context object at 0x7ff7817c7bf0>
2025-05-07T20:33:05.2555467Z 
2025-05-07T20:33:05.2555643Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2555915Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2556025Z                            module_map=module_map)
2025-05-07T20:33:05.2556243Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2556352Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2556470Z E       ^
2025-05-07T20:33:05.2556846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2556850Z 
2025-05-07T20:33:05.2557277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2557282Z 
2025-05-07T20:33:05.2557394Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2557626Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2557707Z     T=16384,
2025-05-07T20:33:05.2557790Z     D=5120,
2025-05-07T20:33:05.2557875Z     scale_ub=None,
2025-05-07T20:33:05.2557961Z     contiguous=True,
2025-05-07T20:33:05.2558055Z     compiled=True,
2025-05-07T20:33:05.2558133Z )
2025-05-07T20:33:05.2558357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2558584Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2558592Z 
2025-05-07T20:33:05.2558708Z     @given(
2025-05-07T20:33:05.2558836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2558942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2559059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2559185Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2559302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2559377Z     )
2025-05-07T20:33:05.2559639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2559734Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2559817Z         self,
2025-05-07T20:33:05.2559894Z         T: int,
2025-05-07T20:33:05.2559971Z         D: int,
2025-05-07T20:33:05.2560140Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2560232Z         contiguous: bool,
2025-05-07T20:33:05.2560322Z         compiled: bool,
2025-05-07T20:33:05.2560408Z     ) -> None:
2025-05-07T20:33:05.2560520Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2560598Z     
2025-05-07T20:33:05.2571890Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2571984Z     
2025-05-07T20:33:05.2572086Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2572235Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2572330Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2572415Z         x0 = x[:, :D]
2025-05-07T20:33:05.2572505Z         x1 = x[:, D:]
2025-05-07T20:33:05.2572579Z     
2025-05-07T20:33:05.2572666Z         if contiguous:
2025-05-07T20:33:05.2572769Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2572861Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2572944Z     
2025-05-07T20:33:05.2573039Z         if scale_ub is not None:
2025-05-07T20:33:05.2573159Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2573313Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2573395Z             )
2025-05-07T20:33:05.2573474Z         else:
2025-05-07T20:33:05.2573585Z             scale_ub_tensor = None
2025-05-07T20:33:05.2573661Z     
2025-05-07T20:33:05.2573799Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2573904Z             op = silu_mul_quant
2025-05-07T20:33:05.2573992Z             if compiled:
2025-05-07T20:33:05.2574099Z                 op = torch.compile(op)
2025-05-07T20:33:05.2574223Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2574297Z     
2025-05-07T20:33:05.2574399Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2574526Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2574600Z     
2025-05-07T20:33:05.2574747Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2574942Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2575047Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2575185Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2575380Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2575456Z     
2025-05-07T20:33:05.2575568Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2575574Z 
2025-05-07T20:33:05.2575676Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2575820Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2575931Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2576071Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2576663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2576769Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2577150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2577441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2577864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2578142Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2578534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2578710Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2579075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2579156Z     fn()
2025-05-07T20:33:05.2579580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2579669Z     self.fn.run(
2025-05-07T20:33:05.2580031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2580140Z     kernel = self.compile(
2025-05-07T20:33:05.2580539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2580723Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2580867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2580872Z 
2025-05-07T20:33:05.2581086Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781b836c0>
2025-05-07T20:33:05.2581902Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2582432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780b25800>}
2025-05-07T20:33:05.2583220Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2583422Z context = <triton._C.libtriton.ir.context object at 0x7ff7812b15f0>
2025-05-07T20:33:05.2583427Z 
2025-05-07T20:33:05.2583602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2583891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2584004Z                            module_map=module_map)
2025-05-07T20:33:05.2584173Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2584288Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2584416Z E       ^
2025-05-07T20:33:05.2584837Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2584845Z 
2025-05-07T20:33:05.2585277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2585282Z 
2025-05-07T20:33:05.2585390Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2585629Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2585708Z     T=1,
2025-05-07T20:33:05.2585794Z     D=5120,
2025-05-07T20:33:05.2585881Z     scale_ub=1200.0,
2025-05-07T20:33:05.2585967Z     contiguous=True,
2025-05-07T20:33:05.2586063Z     compiled=True,
2025-05-07T20:33:05.2586139Z )
2025-05-07T20:33:05.2586365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2586545Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.2586552Z 
2025-05-07T20:33:05.2586633Z     @given(
2025-05-07T20:33:05.2586830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2586944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2587102Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2587231Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2587348Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2587425Z     )
2025-05-07T20:33:05.2587686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2587785Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2587865Z         self,
2025-05-07T20:33:05.2587949Z         T: int,
2025-05-07T20:33:05.2588027Z         D: int,
2025-05-07T20:33:05.2588131Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2588237Z         contiguous: bool,
2025-05-07T20:33:05.2588328Z         compiled: bool,
2025-05-07T20:33:05.2588412Z     ) -> None:
2025-05-07T20:33:05.2588517Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2588597Z     
2025-05-07T20:33:05.2588784Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2588862Z     
2025-05-07T20:33:05.2588962Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2589099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2589192Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2589275Z         x0 = x[:, :D]
2025-05-07T20:33:05.2589365Z         x1 = x[:, D:]
2025-05-07T20:33:05.2589440Z     
2025-05-07T20:33:05.2589529Z         if contiguous:
2025-05-07T20:33:05.2589636Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2589730Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2589805Z     
2025-05-07T20:33:05.2589907Z         if scale_ub is not None:
2025-05-07T20:33:05.2590015Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2590157Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2590245Z             )
2025-05-07T20:33:05.2590324Z         else:
2025-05-07T20:33:05.2590434Z             scale_ub_tensor = None
2025-05-07T20:33:05.2590508Z     
2025-05-07T20:33:05.2590646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2590746Z             op = silu_mul_quant
2025-05-07T20:33:05.2590834Z             if compiled:
2025-05-07T20:33:05.2590938Z                 op = torch.compile(op)
2025-05-07T20:33:05.2591060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2591135Z     
2025-05-07T20:33:05.2591229Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2591234Z 
2025-05-07T20:33:05.2591342Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2591475Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2591588Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2591696Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2592079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2592237Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2592790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2592893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2593281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2593512Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2593875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2593974Z     kernel = self.compile(
2025-05-07T20:33:05.2594422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2594611Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2594751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2594799Z 
2025-05-07T20:33:05.2595060Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78175c410>
2025-05-07T20:33:05.2595870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2596396Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780b334c0>}
2025-05-07T20:33:05.2597178Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2597378Z context = <triton._C.libtriton.ir.context object at 0x7ff7812ee770>
2025-05-07T20:33:05.2597382Z 
2025-05-07T20:33:05.2597564Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2597844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2597958Z                            module_map=module_map)
2025-05-07T20:33:05.2598132Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2598235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2598315Z E       ^
2025-05-07T20:33:05.2598694Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2598699Z 
2025-05-07T20:33:05.2599130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2599135Z 
2025-05-07T20:33:05.2599249Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2599483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2599566Z     T=1,
2025-05-07T20:33:05.2599653Z     D=5120,
2025-05-07T20:33:05.2599741Z     scale_ub=None,
2025-05-07T20:33:05.2599844Z     contiguous=False,
2025-05-07T20:33:05.2599929Z     compiled=True,
2025-05-07T20:33:05.2600005Z )
2025-05-07T20:33:05.2600334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2600511Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2600516Z 
2025-05-07T20:33:05.2600595Z     @given(
2025-05-07T20:33:05.2600725Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2600828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2600947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2601074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2601191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2601327Z     )
2025-05-07T20:33:05.2601583Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2601725Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2601813Z         self,
2025-05-07T20:33:05.2601894Z         T: int,
2025-05-07T20:33:05.2601974Z         D: int,
2025-05-07T20:33:05.2602082Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2602173Z         contiguous: bool,
2025-05-07T20:33:05.2602262Z         compiled: bool,
2025-05-07T20:33:05.2602350Z     ) -> None:
2025-05-07T20:33:05.2602448Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2602523Z     
2025-05-07T20:33:05.2602706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2602783Z     
2025-05-07T20:33:05.2602877Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2603013Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2603104Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2603199Z         x0 = x[:, :D]
2025-05-07T20:33:05.2603282Z         x1 = x[:, D:]
2025-05-07T20:33:05.2603407Z     
2025-05-07T20:33:05.2603505Z         if contiguous:
2025-05-07T20:33:05.2603602Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2603733Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2603821Z     
2025-05-07T20:33:05.2603915Z         if scale_ub is not None:
2025-05-07T20:33:05.2604027Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2604178Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2604259Z             )
2025-05-07T20:33:05.2604337Z         else:
2025-05-07T20:33:05.2604440Z             scale_ub_tensor = None
2025-05-07T20:33:05.2604515Z     
2025-05-07T20:33:05.2604656Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2604750Z             op = silu_mul_quant
2025-05-07T20:33:05.2604836Z             if compiled:
2025-05-07T20:33:05.2604947Z                 op = torch.compile(op)
2025-05-07T20:33:05.2605059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2605134Z     
2025-05-07T20:33:05.2605239Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2605366Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2605445Z     
2025-05-07T20:33:05.2605594Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2605700Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2605803Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2605938Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2606084Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2606167Z     
2025-05-07T20:33:05.2606273Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2606278Z 
2025-05-07T20:33:05.2606383Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2606524Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2606636Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2606776Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2607380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2607485Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2607864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2608097Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2608478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2608753Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2609142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2609380Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2609780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2609864Z     fn()
2025-05-07T20:33:05.2610290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2610379Z     self.fn.run(
2025-05-07T20:33:05.2610734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2610838Z     kernel = self.compile(
2025-05-07T20:33:05.2611234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2611425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2611559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2611566Z 
2025-05-07T20:33:05.2611779Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78175eb10>
2025-05-07T20:33:05.2612675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2613201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780702de0>}
2025-05-07T20:33:05.2614340Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2614543Z context = <triton._C.libtriton.ir.context object at 0x7ff7804f8730>
2025-05-07T20:33:05.2614548Z 
2025-05-07T20:33:05.2614718Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2615005Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2615126Z                            module_map=module_map)
2025-05-07T20:33:05.2615302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2615408Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2615487Z E       ^
2025-05-07T20:33:05.2615862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2615866Z 
2025-05-07T20:33:05.2616297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2616302Z 
2025-05-07T20:33:05.2616417Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2616653Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2616732Z     T=1,
2025-05-07T20:33:05.2616820Z     D=5120,
2025-05-07T20:33:05.2616905Z     scale_ub=None,
2025-05-07T20:33:05.2616995Z     contiguous=True,
2025-05-07T20:33:05.2617092Z     compiled=False,
2025-05-07T20:33:05.2617169Z )
2025-05-07T20:33:05.2617401Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2617576Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.2617581Z 
2025-05-07T20:33:05.2617661Z     @given(
2025-05-07T20:33:05.2617791Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2617893Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2618014Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2618141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2618257Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2618333Z     )
2025-05-07T20:33:05.2618599Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2618860Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2618939Z         self,
2025-05-07T20:33:05.2619029Z         T: int,
2025-05-07T20:33:05.2619108Z         D: int,
2025-05-07T20:33:05.2619310Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2619406Z         contiguous: bool,
2025-05-07T20:33:05.2619494Z         compiled: bool,
2025-05-07T20:33:05.2619580Z     ) -> None:
2025-05-07T20:33:05.2619679Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2619754Z     
2025-05-07T20:33:05.2619936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2620013Z     
2025-05-07T20:33:05.2620109Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2620247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2620338Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2620421Z         x0 = x[:, :D]
2025-05-07T20:33:05.2620510Z         x1 = x[:, D:]
2025-05-07T20:33:05.2620585Z     
2025-05-07T20:33:05.2620671Z         if contiguous:
2025-05-07T20:33:05.2620778Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2620871Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2621024Z     
2025-05-07T20:33:05.2621123Z         if scale_ub is not None:
2025-05-07T20:33:05.2621300Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2621458Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2621539Z             )
2025-05-07T20:33:05.2621619Z         else:
2025-05-07T20:33:05.2621723Z             scale_ub_tensor = None
2025-05-07T20:33:05.2621800Z     
2025-05-07T20:33:05.2621935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2622037Z             op = silu_mul_quant
2025-05-07T20:33:05.2622125Z             if compiled:
2025-05-07T20:33:05.2622228Z                 op = torch.compile(op)
2025-05-07T20:33:05.2622349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2622424Z     
2025-05-07T20:33:05.2622526Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2622533Z 
2025-05-07T20:33:05.2622634Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2622768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2622887Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2622993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2623509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2623618Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2623989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2624229Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2624584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2624681Z     kernel = self.compile(
2025-05-07T20:33:05.2625086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2625276Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2625420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2625425Z 
2025-05-07T20:33:05.2625636Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780db6fc0>
2025-05-07T20:33:05.2626439Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2626972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781029940>}
2025-05-07T20:33:05.2627746Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2628042Z context = <triton._C.libtriton.ir.context object at 0x7ff7806ffcb0>
2025-05-07T20:33:05.2628049Z 
2025-05-07T20:33:05.2628222Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2628504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2628616Z                            module_map=module_map)
2025-05-07T20:33:05.2628782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2628891Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2628975Z E       ^
2025-05-07T20:33:05.2629343Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2629348Z 
2025-05-07T20:33:05.2629784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2629835Z 
2025-05-07T20:33:05.2629944Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2630219Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2630301Z     T=128,
2025-05-07T20:33:05.2630380Z     D=5120,
2025-05-07T20:33:05.2630473Z     scale_ub=None,
2025-05-07T20:33:05.2630563Z     contiguous=False,
2025-05-07T20:33:05.2630650Z     compiled=True,
2025-05-07T20:33:05.2630734Z )
2025-05-07T20:33:05.2630959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2631137Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2631149Z 
2025-05-07T20:33:05.2631230Z     @given(
2025-05-07T20:33:05.2631352Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2631462Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2631582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2631705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2631832Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2631910Z     )
2025-05-07T20:33:05.2632168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2632269Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2632347Z         self,
2025-05-07T20:33:05.2632426Z         T: int,
2025-05-07T20:33:05.2632509Z         D: int,
2025-05-07T20:33:05.2632610Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2632706Z         contiguous: bool,
2025-05-07T20:33:05.2632794Z         compiled: bool,
2025-05-07T20:33:05.2632874Z     ) -> None:
2025-05-07T20:33:05.2632977Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2633050Z     
2025-05-07T20:33:05.2633226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2633307Z     
2025-05-07T20:33:05.2633405Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2633532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2633630Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2633715Z         x0 = x[:, :D]
2025-05-07T20:33:05.2633799Z         x1 = x[:, D:]
2025-05-07T20:33:05.2633884Z     
2025-05-07T20:33:05.2633972Z         if contiguous:
2025-05-07T20:33:05.2634065Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2634163Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2634236Z     
2025-05-07T20:33:05.2634336Z         if scale_ub is not None:
2025-05-07T20:33:05.2634444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2634583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2634666Z             )
2025-05-07T20:33:05.2634746Z         else:
2025-05-07T20:33:05.2634857Z             scale_ub_tensor = None
2025-05-07T20:33:05.2634931Z     
2025-05-07T20:33:05.2635065Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2635266Z             op = silu_mul_quant
2025-05-07T20:33:05.2635358Z             if compiled:
2025-05-07T20:33:05.2635464Z                 op = torch.compile(op)
2025-05-07T20:33:05.2635624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2635702Z     
2025-05-07T20:33:05.2635803Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2635808Z 
2025-05-07T20:33:05.2635907Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2636039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2636149Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2636252Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2636634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2636735Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2637249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2637361Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2637731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2638051Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2638415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2638512Z     kernel = self.compile(
2025-05-07T20:33:05.2638912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2639099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2639230Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2639235Z 
2025-05-07T20:33:05.2639451Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78071ce10>
2025-05-07T20:33:05.2640325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2640865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7807037e0>}
2025-05-07T20:33:05.2641639Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2641837Z context = <triton._C.libtriton.ir.context object at 0x7ff78054c870>
2025-05-07T20:33:05.2641842Z 
2025-05-07T20:33:05.2642020Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2642295Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2642412Z                            module_map=module_map)
2025-05-07T20:33:05.2642581Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2642686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2642773Z E       ^
2025-05-07T20:33:05.2643141Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2643146Z 
2025-05-07T20:33:05.2643574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2643587Z 
2025-05-07T20:33:05.2643692Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2643922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2644006Z     T=128,
2025-05-07T20:33:05.2644084Z     D=7168,
2025-05-07T20:33:05.2644169Z     scale_ub=1200.0,
2025-05-07T20:33:05.2644263Z     contiguous=False,
2025-05-07T20:33:05.2644398Z     compiled=False,
2025-05-07T20:33:05.2644472Z )
2025-05-07T20:33:05.2644702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2644929Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2644934Z 
2025-05-07T20:33:05.2645019Z     @given(
2025-05-07T20:33:05.2645143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2645244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2645368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2645489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2645605Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2645687Z     )
2025-05-07T20:33:05.2645941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2646036Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2646122Z         self,
2025-05-07T20:33:05.2646203Z         T: int,
2025-05-07T20:33:05.2646282Z         D: int,
2025-05-07T20:33:05.2646389Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2646529Z         contiguous: bool,
2025-05-07T20:33:05.2646625Z         compiled: bool,
2025-05-07T20:33:05.2646743Z     ) -> None:
2025-05-07T20:33:05.2646844Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2646924Z     
2025-05-07T20:33:05.2647100Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2647177Z     
2025-05-07T20:33:05.2647278Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2647405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2647497Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2647584Z         x0 = x[:, :D]
2025-05-07T20:33:05.2647668Z         x1 = x[:, D:]
2025-05-07T20:33:05.2647744Z     
2025-05-07T20:33:05.2647836Z         if contiguous:
2025-05-07T20:33:05.2647929Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2648028Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2648104Z     
2025-05-07T20:33:05.2648196Z         if scale_ub is not None:
2025-05-07T20:33:05.2648314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2648456Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2648536Z             )
2025-05-07T20:33:05.2648620Z         else:
2025-05-07T20:33:05.2648717Z             scale_ub_tensor = None
2025-05-07T20:33:05.2648792Z     
2025-05-07T20:33:05.2648932Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2649025Z             op = silu_mul_quant
2025-05-07T20:33:05.2649113Z             if compiled:
2025-05-07T20:33:05.2649226Z                 op = torch.compile(op)
2025-05-07T20:33:05.2649334Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2649410Z     
2025-05-07T20:33:05.2649510Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2649514Z 
2025-05-07T20:33:05.2649614Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2649752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2649859Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2649965Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2650493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2650595Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2650967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2651206Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2651559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2651661Z     kernel = self.compile(
2025-05-07T20:33:05.2652057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2652314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2652453Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2652496Z 
2025-05-07T20:33:05.2652711Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7807207d0>
2025-05-07T20:33:05.2653525Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2654049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780720400>}
2025-05-07T20:33:05.2654830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2655031Z context = <triton._C.libtriton.ir.context object at 0x7ff78050f970>
2025-05-07T20:33:05.2655075Z 
2025-05-07T20:33:05.2655249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2655570Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2655684Z                            module_map=module_map)
2025-05-07T20:33:05.2655849Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2655958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2656038Z E       ^
2025-05-07T20:33:05.2656413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2656418Z 
2025-05-07T20:33:05.2656846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2656851Z 
2025-05-07T20:33:05.2656960Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2657198Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2657281Z     T=128,
2025-05-07T20:33:05.2657373Z     D=5120,
2025-05-07T20:33:05.2657462Z     scale_ub=None,
2025-05-07T20:33:05.2657551Z     contiguous=False,
2025-05-07T20:33:05.2657645Z     compiled=False,
2025-05-07T20:33:05.2657720Z )
2025-05-07T20:33:05.2657948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2658131Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.2658136Z 
2025-05-07T20:33:05.2658216Z     @given(
2025-05-07T20:33:05.2658338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2658446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2658563Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2658689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2658807Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2658884Z     )
2025-05-07T20:33:05.2659150Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2659248Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2659325Z         self,
2025-05-07T20:33:05.2659408Z         T: int,
2025-05-07T20:33:05.2659485Z         D: int,
2025-05-07T20:33:05.2659587Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2659685Z         contiguous: bool,
2025-05-07T20:33:05.2659773Z         compiled: bool,
2025-05-07T20:33:05.2659853Z     ) -> None:
2025-05-07T20:33:05.2659955Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2660029Z     
2025-05-07T20:33:05.2660207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2660288Z     
2025-05-07T20:33:05.2660382Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2660513Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2660652Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2660735Z         x0 = x[:, :D]
2025-05-07T20:33:05.2660829Z         x1 = x[:, D:]
2025-05-07T20:33:05.2660903Z     
2025-05-07T20:33:05.2661029Z         if contiguous:
2025-05-07T20:33:05.2661133Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2661224Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2661298Z     
2025-05-07T20:33:05.2661397Z         if scale_ub is not None:
2025-05-07T20:33:05.2661505Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2661644Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2661727Z             )
2025-05-07T20:33:05.2661805Z         else:
2025-05-07T20:33:05.2661907Z             scale_ub_tensor = None
2025-05-07T20:33:05.2661980Z     
2025-05-07T20:33:05.2662112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2662213Z             op = silu_mul_quant
2025-05-07T20:33:05.2662299Z             if compiled:
2025-05-07T20:33:05.2662403Z                 op = torch.compile(op)
2025-05-07T20:33:05.2662516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2662633Z     
2025-05-07T20:33:05.2662726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2662733Z 
2025-05-07T20:33:05.2662877Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2663011Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2663119Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2663221Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2663736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2663844Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2664218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2664451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2664818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2664917Z     kernel = self.compile(
2025-05-07T20:33:05.2665325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2665507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2665637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2665641Z 
2025-05-07T20:33:05.2665857Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780b32e90>
2025-05-07T20:33:05.2666662Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2667192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781029f80>}
2025-05-07T20:33:05.2667971Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2668170Z context = <triton._C.libtriton.ir.context object at 0x7ff780af7f30>
2025-05-07T20:33:05.2668181Z 
2025-05-07T20:33:05.2668355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2668628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2668745Z                            module_map=module_map)
2025-05-07T20:33:05.2668912Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2669013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2669098Z E       ^
2025-05-07T20:33:05.2669511Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2669518Z 
2025-05-07T20:33:05.2669997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2670002Z 
2025-05-07T20:33:05.2670111Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2670341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2670427Z     T=128,
2025-05-07T20:33:05.2670506Z     D=5120,
2025-05-07T20:33:05.2670592Z     scale_ub=1200.0,
2025-05-07T20:33:05.2670688Z     contiguous=True,
2025-05-07T20:33:05.2670774Z     compiled=False,
2025-05-07T20:33:05.2670850Z )
2025-05-07T20:33:05.2671083Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2671260Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.2671264Z 
2025-05-07T20:33:05.2671353Z     @given(
2025-05-07T20:33:05.2671478Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2671630Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2671759Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2671919Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2672038Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2672123Z     )
2025-05-07T20:33:05.2672378Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2672474Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2672560Z         self,
2025-05-07T20:33:05.2672637Z         T: int,
2025-05-07T20:33:05.2672723Z         D: int,
2025-05-07T20:33:05.2672824Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2672916Z         contiguous: bool,
2025-05-07T20:33:05.2673011Z         compiled: bool,
2025-05-07T20:33:05.2673091Z     ) -> None:
2025-05-07T20:33:05.2673188Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2673269Z     
2025-05-07T20:33:05.2673444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2673524Z     
2025-05-07T20:33:05.2673628Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2673757Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2673848Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2673936Z         x0 = x[:, :D]
2025-05-07T20:33:05.2674017Z         x1 = x[:, D:]
2025-05-07T20:33:05.2674096Z     
2025-05-07T20:33:05.2674182Z         if contiguous:
2025-05-07T20:33:05.2674275Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2674372Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2674446Z     
2025-05-07T20:33:05.2674538Z         if scale_ub is not None:
2025-05-07T20:33:05.2674651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2674791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2674867Z             )
2025-05-07T20:33:05.2674954Z         else:
2025-05-07T20:33:05.2675053Z             scale_ub_tensor = None
2025-05-07T20:33:05.2675127Z     
2025-05-07T20:33:05.2675274Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2675370Z             op = silu_mul_quant
2025-05-07T20:33:05.2675461Z             if compiled:
2025-05-07T20:33:05.2675569Z                 op = torch.compile(op)
2025-05-07T20:33:05.2675676Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2675757Z     
2025-05-07T20:33:05.2675848Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2675853Z 
2025-05-07T20:33:05.2675951Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2676088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2676191Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2676292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2676817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2676968Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2677389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2677627Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2677981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2678083Z     kernel = self.compile(
2025-05-07T20:33:05.2678479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2678659Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2678793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2678798Z 
2025-05-07T20:33:05.2679009Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a6cfd520>
2025-05-07T20:33:05.2679868Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2680508Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780518c20>}
2025-05-07T20:33:05.2681283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2681480Z context = <triton._C.libtriton.ir.context object at 0x7ff780313db0>
2025-05-07T20:33:05.2681485Z 
2025-05-07T20:33:05.2681656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2681935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2682048Z                            module_map=module_map)
2025-05-07T20:33:05.2682223Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2682328Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2682408Z E       ^
2025-05-07T20:33:05.2682780Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2682785Z 
2025-05-07T20:33:05.2683213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2683217Z 
2025-05-07T20:33:05.2683323Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2683563Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2683642Z     T=1,
2025-05-07T20:33:05.2683725Z     D=7168,
2025-05-07T20:33:05.2683811Z     scale_ub=1200.0,
2025-05-07T20:33:05.2683898Z     contiguous=True,
2025-05-07T20:33:05.2683990Z     compiled=True,
2025-05-07T20:33:05.2684066Z )
2025-05-07T20:33:05.2684295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2684479Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.2684483Z 
2025-05-07T20:33:05.2684568Z     @given(
2025-05-07T20:33:05.2684690Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2684799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2684918Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2685046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2685162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2685240Z     )
2025-05-07T20:33:05.2685502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2685599Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2685677Z         self,
2025-05-07T20:33:05.2685818Z         T: int,
2025-05-07T20:33:05.2685896Z         D: int,
2025-05-07T20:33:05.2685996Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2686097Z         contiguous: bool,
2025-05-07T20:33:05.2686226Z         compiled: bool,
2025-05-07T20:33:05.2686309Z     ) -> None:
2025-05-07T20:33:05.2686414Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2686487Z     
2025-05-07T20:33:05.2686669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2686743Z     
2025-05-07T20:33:05.2686837Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2686970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2687060Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2687140Z         x0 = x[:, :D]
2025-05-07T20:33:05.2687227Z         x1 = x[:, D:]
2025-05-07T20:33:05.2687300Z     
2025-05-07T20:33:05.2687386Z         if contiguous:
2025-05-07T20:33:05.2687488Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2687579Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2687655Z     
2025-05-07T20:33:05.2687753Z         if scale_ub is not None:
2025-05-07T20:33:05.2687904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2688091Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2688168Z             )
2025-05-07T20:33:05.2688246Z         else:
2025-05-07T20:33:05.2688346Z             scale_ub_tensor = None
2025-05-07T20:33:05.2688419Z     
2025-05-07T20:33:05.2688550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2688649Z             op = silu_mul_quant
2025-05-07T20:33:05.2688735Z             if compiled:
2025-05-07T20:33:05.2688835Z                 op = torch.compile(op)
2025-05-07T20:33:05.2688948Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2689020Z     
2025-05-07T20:33:05.2689111Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2689115Z 
2025-05-07T20:33:05.2689220Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2689349Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2689460Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2689564Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2689947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2690050Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2690558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2690657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2691032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2691263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2691618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2691718Z     kernel = self.compile(
2025-05-07T20:33:05.2692110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2692303Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2692432Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2692436Z 
2025-05-07T20:33:05.2692652Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c6750>
2025-05-07T20:33:05.2693454Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2693975Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780519ee0>}
2025-05-07T20:33:05.2694883Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2695084Z context = <triton._C.libtriton.ir.context object at 0x7ff780332330>
2025-05-07T20:33:05.2695088Z 
2025-05-07T20:33:05.2695261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2695534Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2695643Z                            module_map=module_map)
2025-05-07T20:33:05.2695812Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2695912Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2695996Z E       ^
2025-05-07T20:33:05.2696361Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2696368Z 
2025-05-07T20:33:05.2696795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2696853Z 
2025-05-07T20:33:05.2696969Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2697238Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2697317Z     T=1,
2025-05-07T20:33:05.2697400Z     D=7168,
2025-05-07T20:33:05.2697485Z     scale_ub=1200.0,
2025-05-07T20:33:05.2697578Z     contiguous=False,
2025-05-07T20:33:05.2697663Z     compiled=True,
2025-05-07T20:33:05.2697736Z )
2025-05-07T20:33:05.2697964Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2698135Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.2698139Z 
2025-05-07T20:33:05.2698216Z     @given(
2025-05-07T20:33:05.2698343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2698449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2702995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2703152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2703290Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2703368Z     )
2025-05-07T20:33:05.2703624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2703733Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2703812Z         self,
2025-05-07T20:33:05.2703892Z         T: int,
2025-05-07T20:33:05.2703980Z         D: int,
2025-05-07T20:33:05.2704082Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2704181Z         contiguous: bool,
2025-05-07T20:33:05.2704270Z         compiled: bool,
2025-05-07T20:33:05.2704356Z     ) -> None:
2025-05-07T20:33:05.2704461Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2704539Z     
2025-05-07T20:33:05.2704717Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2704804Z     
2025-05-07T20:33:05.2704901Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2705036Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2705144Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2705232Z         x0 = x[:, :D]
2025-05-07T20:33:05.2705315Z         x1 = x[:, D:]
2025-05-07T20:33:05.2705399Z     
2025-05-07T20:33:05.2705490Z         if contiguous:
2025-05-07T20:33:05.2705588Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2705692Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2705768Z     
2025-05-07T20:33:05.2705870Z         if scale_ub is not None:
2025-05-07T20:33:05.2705981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2706123Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2706210Z             )
2025-05-07T20:33:05.2706289Z         else:
2025-05-07T20:33:05.2706388Z             scale_ub_tensor = None
2025-05-07T20:33:05.2706471Z     
2025-05-07T20:33:05.2706695Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2706789Z             op = silu_mul_quant
2025-05-07T20:33:05.2706887Z             if compiled:
2025-05-07T20:33:05.2707036Z                 op = torch.compile(op)
2025-05-07T20:33:05.2707152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2707237Z     
2025-05-07T20:33:05.2707332Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2707337Z 
2025-05-07T20:33:05.2707453Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2707587Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2707693Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2707808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2708197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2708296Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2708817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2708967Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2709390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2709625Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2709977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2710080Z     kernel = self.compile(
2025-05-07T20:33:05.2710477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2710667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2710799Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2710803Z 
2025-05-07T20:33:05.2711014Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a60f4bd0>
2025-05-07T20:33:05.2711833Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2712360Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78051ac00>}
2025-05-07T20:33:05.2713137Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2713689Z context = <triton._C.libtriton.ir.context object at 0x7ff780302ab0>
2025-05-07T20:33:05.2713698Z 
2025-05-07T20:33:05.2713951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2714266Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2714386Z                            module_map=module_map)
2025-05-07T20:33:05.2714567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2714670Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2714755Z E       ^
2025-05-07T20:33:05.2715130Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2715134Z 
2025-05-07T20:33:05.2715566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2715570Z 
2025-05-07T20:33:05.2715688Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2715921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2716000Z     T=1,
2025-05-07T20:33:05.2716087Z     D=7168,
2025-05-07T20:33:05.2716354Z     scale_ub=None,
2025-05-07T20:33:05.2716445Z     contiguous=False,
2025-05-07T20:33:05.2716542Z     compiled=True,
2025-05-07T20:33:05.2716622Z )
2025-05-07T20:33:05.2716922Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2717106Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2717111Z 
2025-05-07T20:33:05.2717191Z     @given(
2025-05-07T20:33:05.2717321Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2717425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2717544Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2717672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2717789Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2717867Z     )
2025-05-07T20:33:05.2718129Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2718228Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2718311Z         self,
2025-05-07T20:33:05.2718398Z         T: int,
2025-05-07T20:33:05.2718477Z         D: int,
2025-05-07T20:33:05.2718685Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2718788Z         contiguous: bool,
2025-05-07T20:33:05.2718948Z         compiled: bool,
2025-05-07T20:33:05.2719038Z     ) -> None:
2025-05-07T20:33:05.2719138Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2719218Z     
2025-05-07T20:33:05.2719403Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2719480Z     
2025-05-07T20:33:05.2719576Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2719712Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2719805Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2719888Z         x0 = x[:, :D]
2025-05-07T20:33:05.2719978Z         x1 = x[:, D:]
2025-05-07T20:33:05.2720053Z     
2025-05-07T20:33:05.2720214Z         if contiguous:
2025-05-07T20:33:05.2720318Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2720413Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2720489Z     
2025-05-07T20:33:05.2720595Z         if scale_ub is not None:
2025-05-07T20:33:05.2720708Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2720859Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2720940Z             )
2025-05-07T20:33:05.2721020Z         else:
2025-05-07T20:33:05.2721127Z             scale_ub_tensor = None
2025-05-07T20:33:05.2721202Z     
2025-05-07T20:33:05.2721337Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2721438Z             op = silu_mul_quant
2025-05-07T20:33:05.2721528Z             if compiled:
2025-05-07T20:33:05.2721631Z                 op = torch.compile(op)
2025-05-07T20:33:05.2721753Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2721829Z     
2025-05-07T20:33:05.2721922Z         y_fp8, y_scale = fn()
2025-05-07T20:33:05.2722056Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:05.2722134Z     
2025-05-07T20:33:05.2722286Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2722396Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:05.2722505Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:05.2722639Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:05.2722786Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2722861Z     
2025-05-07T20:33:05.2722973Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:05.2722977Z 
2025-05-07T20:33:05.2723079Z moe/activation_test.py:126: 
2025-05-07T20:33:05.2723219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2723328Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:05.2723468Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:05.2724054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:05.2724213Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:05.2724629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2724875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2725255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:05.2725529Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:05.2725921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:05.2726094Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:05.2726458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:05.2726541Z     fn()
2025-05-07T20:33:05.2726958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:05.2727096Z     self.fn.run(
2025-05-07T20:33:05.2727486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2727594Z     kernel = self.compile(
2025-05-07T20:33:05.2727990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2728170Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2728309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2728314Z 
2025-05-07T20:33:05.2728527Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7a60f5350>
2025-05-07T20:33:05.2729336Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2729870Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3c180>}
2025-05-07T20:33:05.2730643Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2730849Z context = <triton._C.libtriton.ir.context object at 0x7ff780ae6ef0>
2025-05-07T20:33:05.2730854Z 
2025-05-07T20:33:05.2731026Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2731309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2731422Z                            module_map=module_map)
2025-05-07T20:33:05.2731591Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2731707Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:05.2731789Z E       ^
2025-05-07T20:33:05.2732167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2732171Z 
2025-05-07T20:33:05.2732604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2732608Z 
2025-05-07T20:33:05.2732722Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2732961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2733040Z     T=1,
2025-05-07T20:33:05.2733119Z     D=5120,
2025-05-07T20:33:05.2733215Z     scale_ub=1200.0,
2025-05-07T20:33:05.2733304Z     contiguous=False,
2025-05-07T20:33:05.2733398Z     compiled=True,
2025-05-07T20:33:05.2733474Z )
2025-05-07T20:33:05.2733749Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2733934Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.2733979Z 
2025-05-07T20:33:05.2734060Z     @given(
2025-05-07T20:33:05.2734188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2734298Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2734417Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2734537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2734660Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2734738Z     )
2025-05-07T20:33:05.2734999Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2735096Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2735175Z         self,
2025-05-07T20:33:05.2735263Z         T: int,
2025-05-07T20:33:05.2735341Z         D: int,
2025-05-07T20:33:05.2735447Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2735550Z         contiguous: bool,
2025-05-07T20:33:05.2735683Z         compiled: bool,
2025-05-07T20:33:05.2735764Z     ) -> None:
2025-05-07T20:33:05.2735876Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2735991Z     
2025-05-07T20:33:05.2736169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2736256Z     
2025-05-07T20:33:05.2736353Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2736489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2736580Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2736665Z         x0 = x[:, :D]
2025-05-07T20:33:05.2736755Z         x1 = x[:, D:]
2025-05-07T20:33:05.2736832Z     
2025-05-07T20:33:05.2736923Z         if contiguous:
2025-05-07T20:33:05.2737031Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2737127Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2737202Z     
2025-05-07T20:33:05.2737305Z         if scale_ub is not None:
2025-05-07T20:33:05.2737418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2737559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2738132Z             )
2025-05-07T20:33:05.2738215Z         else:
2025-05-07T20:33:05.2738325Z             scale_ub_tensor = None
2025-05-07T20:33:05.2738402Z     
2025-05-07T20:33:05.2738537Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2738642Z             op = silu_mul_quant
2025-05-07T20:33:05.2738732Z             if compiled:
2025-05-07T20:33:05.2738835Z                 op = torch.compile(op)
2025-05-07T20:33:05.2738950Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2739027Z     
2025-05-07T20:33:05.2739122Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2739126Z 
2025-05-07T20:33:05.2739238Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2739370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2739484Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2739589Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2739975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2740081Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2740592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2740696Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2741076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2741306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2741666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2741765Z     kernel = self.compile(
2025-05-07T20:33:05.2742161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2742409Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2742584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2742589Z 
2025-05-07T20:33:05.2742801Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c7bd0>
2025-05-07T20:33:05.2743610Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2744133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3d300>}
2025-05-07T20:33:05.2744913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2745161Z context = <triton._C.libtriton.ir.context object at 0x7ff780011eb0>
2025-05-07T20:33:05.2745166Z 
2025-05-07T20:33:05.2745386Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2745664Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2745777Z                            module_map=module_map)
2025-05-07T20:33:05.2745949Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2746052Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2746135Z E       ^
2025-05-07T20:33:05.2746507Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2746512Z 
2025-05-07T20:33:05.2746940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2746947Z 
2025-05-07T20:33:05.2747066Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2747303Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2747386Z     T=1,
2025-05-07T20:33:05.2747472Z     D=5120,
2025-05-07T20:33:05.2747559Z     scale_ub=1200.0,
2025-05-07T20:33:05.2747648Z     contiguous=False,
2025-05-07T20:33:05.2747741Z     compiled=False,
2025-05-07T20:33:05.2747818Z )
2025-05-07T20:33:05.2748049Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2748225Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2748229Z 
2025-05-07T20:33:05.2748308Z     @given(
2025-05-07T20:33:05.2748437Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2748538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2748655Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2748789Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2748910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2748985Z     )
2025-05-07T20:33:05.2749249Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2749343Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2749430Z         self,
2025-05-07T20:33:05.2749507Z         T: int,
2025-05-07T20:33:05.2749585Z         D: int,
2025-05-07T20:33:05.2749691Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2749782Z         contiguous: bool,
2025-05-07T20:33:05.2749872Z         compiled: bool,
2025-05-07T20:33:05.2749958Z     ) -> None:
2025-05-07T20:33:05.2750054Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2750127Z     
2025-05-07T20:33:05.2750312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2750387Z     
2025-05-07T20:33:05.2750482Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2750664Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2750755Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2750849Z         x0 = x[:, :D]
2025-05-07T20:33:05.2751004Z         x1 = x[:, D:]
2025-05-07T20:33:05.2751082Z     
2025-05-07T20:33:05.2751176Z         if contiguous:
2025-05-07T20:33:05.2751271Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2751363Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2751448Z     
2025-05-07T20:33:05.2751542Z         if scale_ub is not None:
2025-05-07T20:33:05.2751653Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2751800Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2751881Z             )
2025-05-07T20:33:05.2751964Z         else:
2025-05-07T20:33:05.2752071Z             scale_ub_tensor = None
2025-05-07T20:33:05.2752146Z     
2025-05-07T20:33:05.2752288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2752381Z             op = silu_mul_quant
2025-05-07T20:33:05.2752471Z             if compiled:
2025-05-07T20:33:05.2752582Z                 op = torch.compile(op)
2025-05-07T20:33:05.2752746Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2752822Z     
2025-05-07T20:33:05.2752961Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2752967Z 
2025-05-07T20:33:05.2753068Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2753210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2753313Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2753415Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2753935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2754035Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2754405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2754645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2755000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2755105Z     kernel = self.compile(
2025-05-07T20:33:05.2755498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2755677Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2755813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2755817Z 
2025-05-07T20:33:05.2756024Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781208b50>
2025-05-07T20:33:05.2756834Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2757361Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3e020>}
2025-05-07T20:33:05.2758131Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2758334Z context = <triton._C.libtriton.ir.context object at 0x7ff780a0ac70>
2025-05-07T20:33:05.2758339Z 
2025-05-07T20:33:05.2758511Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2758793Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2758903Z                            module_map=module_map)
2025-05-07T20:33:05.2759068Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2759222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2759302Z E       ^
2025-05-07T20:33:05.2759707Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2759724Z 
2025-05-07T20:33:05.2760247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2760252Z 
2025-05-07T20:33:05.2760359Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2760594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2760672Z     T=16384,
2025-05-07T20:33:05.2760751Z     D=5120,
2025-05-07T20:33:05.2760843Z     scale_ub=1200.0,
2025-05-07T20:33:05.2760932Z     contiguous=False,
2025-05-07T20:33:05.2761017Z     compiled=True,
2025-05-07T20:33:05.2761096Z )
2025-05-07T20:33:05.2761319Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2761509Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.2761516Z 
2025-05-07T20:33:05.2761642Z     @given(
2025-05-07T20:33:05.2761765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2761913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2762031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2762150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2762271Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2762346Z     )
2025-05-07T20:33:05.2762604Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2762704Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2762780Z         self,
2025-05-07T20:33:05.2762864Z         T: int,
2025-05-07T20:33:05.2762940Z         D: int,
2025-05-07T20:33:05.2763039Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2763137Z         contiguous: bool,
2025-05-07T20:33:05.2763228Z         compiled: bool,
2025-05-07T20:33:05.2763306Z     ) -> None:
2025-05-07T20:33:05.2763408Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2763486Z     
2025-05-07T20:33:05.2763661Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2763744Z     
2025-05-07T20:33:05.2763837Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2763963Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2764060Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2764140Z         x0 = x[:, :D]
2025-05-07T20:33:05.2764226Z         x1 = x[:, D:]
2025-05-07T20:33:05.2764298Z     
2025-05-07T20:33:05.2764385Z         if contiguous:
2025-05-07T20:33:05.2764483Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2764572Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2764664Z     
2025-05-07T20:33:05.2764756Z         if scale_ub is not None:
2025-05-07T20:33:05.2764866Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2765011Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2765090Z             )
2025-05-07T20:33:05.2765167Z         else:
2025-05-07T20:33:05.2765276Z             scale_ub_tensor = None
2025-05-07T20:33:05.2765352Z     
2025-05-07T20:33:05.2765496Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2765587Z             op = silu_mul_quant
2025-05-07T20:33:05.2765673Z             if compiled:
2025-05-07T20:33:05.2765783Z                 op = torch.compile(op)
2025-05-07T20:33:05.2765890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2765963Z     
2025-05-07T20:33:05.2766065Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2766070Z 
2025-05-07T20:33:05.2766170Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2766301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2766411Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2766511Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2766947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2767047Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2767596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2767704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2768075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2768306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2768667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2768763Z     kernel = self.compile(
2025-05-07T20:33:05.2769166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2769347Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2769474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2769521Z 
2025-05-07T20:33:05.2769781Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff42950>
2025-05-07T20:33:05.2770586Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2771116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780a3f600>}
2025-05-07T20:33:05.2771885Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2772091Z context = <triton._C.libtriton.ir.context object at 0x7ff7800986f0>
2025-05-07T20:33:05.2772098Z 
2025-05-07T20:33:05.2772271Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2772544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2772658Z                            module_map=module_map)
2025-05-07T20:33:05.2772822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2772923Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2773008Z E       ^
2025-05-07T20:33:05.2773372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2773377Z 
2025-05-07T20:33:05.2773807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2773812Z 
2025-05-07T20:33:05.2773920Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2774147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2774234Z     T=2048,
2025-05-07T20:33:05.2774314Z     D=7168,
2025-05-07T20:33:05.2774401Z     scale_ub=1200.0,
2025-05-07T20:33:05.2774496Z     contiguous=False,
2025-05-07T20:33:05.2774582Z     compiled=True,
2025-05-07T20:33:05.2774656Z )
2025-05-07T20:33:05.2774889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2775068Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.2775073Z 
2025-05-07T20:33:05.2775156Z     @given(
2025-05-07T20:33:05.2775276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2775378Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2775504Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2775624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2775814Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2775897Z     )
2025-05-07T20:33:05.2776153Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2776294Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2776376Z         self,
2025-05-07T20:33:05.2776458Z         T: int,
2025-05-07T20:33:05.2776541Z         D: int,
2025-05-07T20:33:05.2776642Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2776734Z         contiguous: bool,
2025-05-07T20:33:05.2776828Z         compiled: bool,
2025-05-07T20:33:05.2776906Z     ) -> None:
2025-05-07T20:33:05.2777004Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2777082Z     
2025-05-07T20:33:05.2777255Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2777331Z     
2025-05-07T20:33:05.2777428Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2777554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2777655Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2777739Z         x0 = x[:, :D]
2025-05-07T20:33:05.2777819Z         x1 = x[:, D:]
2025-05-07T20:33:05.2777944Z     
2025-05-07T20:33:05.2778031Z         if contiguous:
2025-05-07T20:33:05.2778126Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2778262Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2778336Z     
2025-05-07T20:33:05.2778429Z         if scale_ub is not None:
2025-05-07T20:33:05.2778543Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2778680Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2778756Z             )
2025-05-07T20:33:05.2778841Z         else:
2025-05-07T20:33:05.2778937Z             scale_ub_tensor = None
2025-05-07T20:33:05.2779010Z     
2025-05-07T20:33:05.2779149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2779240Z             op = silu_mul_quant
2025-05-07T20:33:05.2779333Z             if compiled:
2025-05-07T20:33:05.2779434Z                 op = torch.compile(op)
2025-05-07T20:33:05.2779543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2779630Z     
2025-05-07T20:33:05.2779726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2779730Z 
2025-05-07T20:33:05.2779838Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2779975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2780078Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2780180Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2780568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2780663Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2781178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2781279Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2781650Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2781890Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2782250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2782354Z     kernel = self.compile(
2025-05-07T20:33:05.2782750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2782933Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2783069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2783073Z 
2025-05-07T20:33:05.2783284Z self = <triton.compiler.compiler.ASTSource object at 0x7ff781209f50>
2025-05-07T20:33:05.2784089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2784743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780038720>}
2025-05-07T20:33:05.2785520Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2785725Z context = <triton._C.libtriton.ir.context object at 0x7ff78003fcb0>
2025-05-07T20:33:05.2785729Z 
2025-05-07T20:33:05.2785899Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2786178Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2786286Z                            module_map=module_map)
2025-05-07T20:33:05.2786454Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2786563Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2786696Z E       ^
2025-05-07T20:33:05.2787104Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2787116Z 
2025-05-07T20:33:05.2787545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2787549Z 
2025-05-07T20:33:05.2787655Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2787890Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2787972Z     T=1,
2025-05-07T20:33:05.2788049Z     D=5120,
2025-05-07T20:33:05.2788137Z     scale_ub=None,
2025-05-07T20:33:05.2788226Z     contiguous=False,
2025-05-07T20:33:05.2788311Z     compiled=False,
2025-05-07T20:33:05.2788389Z )
2025-05-07T20:33:05.2788613Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2788797Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.2788807Z 
2025-05-07T20:33:05.2788884Z     @given(
2025-05-07T20:33:05.2789010Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2789117Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2789233Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2789351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2789472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2789546Z     )
2025-05-07T20:33:05.2789799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2789898Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2789976Z         self,
2025-05-07T20:33:05.2790060Z         T: int,
2025-05-07T20:33:05.2790137Z         D: int,
2025-05-07T20:33:05.2790236Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2790334Z         contiguous: bool,
2025-05-07T20:33:05.2790420Z         compiled: bool,
2025-05-07T20:33:05.2790500Z     ) -> None:
2025-05-07T20:33:05.2790606Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2790683Z     
2025-05-07T20:33:05.2790859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2790941Z     
2025-05-07T20:33:05.2791034Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2791159Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2791259Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2791341Z         x0 = x[:, :D]
2025-05-07T20:33:05.2791430Z         x1 = x[:, D:]
2025-05-07T20:33:05.2791503Z     
2025-05-07T20:33:05.2791589Z         if contiguous:
2025-05-07T20:33:05.2791687Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2791776Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2791849Z     
2025-05-07T20:33:05.2791950Z         if scale_ub is not None:
2025-05-07T20:33:05.2792058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2792244Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2792329Z             )
2025-05-07T20:33:05.2792406Z         else:
2025-05-07T20:33:05.2792540Z             scale_ub_tensor = None
2025-05-07T20:33:05.2792622Z     
2025-05-07T20:33:05.2792756Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2792850Z             op = silu_mul_quant
2025-05-07T20:33:05.2792942Z             if compiled:
2025-05-07T20:33:05.2793044Z                 op = torch.compile(op)
2025-05-07T20:33:05.2793156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2793228Z     
2025-05-07T20:33:05.2793320Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2793324Z 
2025-05-07T20:33:05.2793434Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2793563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2793666Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2793777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2794291Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2794444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2794851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2795085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2795446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2795543Z     kernel = self.compile(
2025-05-07T20:33:05.2795935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2796121Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2796248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2796255Z 
2025-05-07T20:33:05.2796468Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180f450>
2025-05-07T20:33:05.2797277Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2797798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780039120>}
2025-05-07T20:33:05.2798575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2798770Z context = <triton._C.libtriton.ir.context object at 0x7ff781804530>
2025-05-07T20:33:05.2798776Z 
2025-05-07T20:33:05.2798951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2799231Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2799347Z                            module_map=module_map)
2025-05-07T20:33:05.2799512Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2799612Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2799698Z E       ^
2025-05-07T20:33:05.2800181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2800186Z 
2025-05-07T20:33:05.2800615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2800619Z 
2025-05-07T20:33:05.2800732Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2800961Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2801095Z     T=4096,
2025-05-07T20:33:05.2801174Z     D=7168,
2025-05-07T20:33:05.2801263Z     scale_ub=1200.0,
2025-05-07T20:33:05.2801357Z     contiguous=False,
2025-05-07T20:33:05.2801484Z     compiled=False,
2025-05-07T20:33:05.2801563Z )
2025-05-07T20:33:05.2801795Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2801976Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2801980Z 
2025-05-07T20:33:05.2802059Z     @given(
2025-05-07T20:33:05.2802187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2802288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2802412Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2802532Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2802648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2802731Z     )
2025-05-07T20:33:05.2802984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2803081Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2803206Z         self,
2025-05-07T20:33:05.2803283Z         T: int,
2025-05-07T20:33:05.2803362Z         D: int,
2025-05-07T20:33:05.2803510Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2803603Z         contiguous: bool,
2025-05-07T20:33:05.2803690Z         compiled: bool,
2025-05-07T20:33:05.2803775Z     ) -> None:
2025-05-07T20:33:05.2803870Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2803949Z     
2025-05-07T20:33:05.2804122Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2804197Z     
2025-05-07T20:33:05.2804296Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2804422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2804511Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2804597Z         x0 = x[:, :D]
2025-05-07T20:33:05.2804681Z         x1 = x[:, D:]
2025-05-07T20:33:05.2804757Z     
2025-05-07T20:33:05.2804848Z         if contiguous:
2025-05-07T20:33:05.2804941Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2805037Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2805120Z     
2025-05-07T20:33:05.2805217Z         if scale_ub is not None:
2025-05-07T20:33:05.2805331Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2805467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2805546Z             )
2025-05-07T20:33:05.2805627Z         else:
2025-05-07T20:33:05.2805723Z             scale_ub_tensor = None
2025-05-07T20:33:05.2805797Z     
2025-05-07T20:33:05.2805935Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2806026Z             op = silu_mul_quant
2025-05-07T20:33:05.2806111Z             if compiled:
2025-05-07T20:33:05.2806219Z                 op = torch.compile(op)
2025-05-07T20:33:05.2806326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2806397Z     
2025-05-07T20:33:05.2806497Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2806502Z 
2025-05-07T20:33:05.2806600Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2806741Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2806848Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2806949Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2807467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2807567Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2807939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2808175Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2808525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2808680Z     kernel = self.compile(
2025-05-07T20:33:05.2809074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2809299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2809435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2809439Z 
2025-05-07T20:33:05.2809648Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780fe92d0>
2025-05-07T20:33:05.2810455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2810978Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78003a480>}
2025-05-07T20:33:05.2811753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2812062Z context = <triton._C.libtriton.ir.context object at 0x7ff78064ee70>
2025-05-07T20:33:05.2812067Z 
2025-05-07T20:33:05.2812237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2812520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2812630Z                            module_map=module_map)
2025-05-07T20:33:05.2812794Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2812902Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2812981Z E       ^
2025-05-07T20:33:05.2813726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2813754Z 
2025-05-07T20:33:05.2814250Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2814259Z 
2025-05-07T20:33:05.2814373Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2814612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2814692Z     T=16384,
2025-05-07T20:33:05.2814770Z     D=7168,
2025-05-07T20:33:05.2814859Z     scale_ub=None,
2025-05-07T20:33:05.2814946Z     contiguous=True,
2025-05-07T20:33:05.2815031Z     compiled=True,
2025-05-07T20:33:05.2815112Z )
2025-05-07T20:33:05.2815336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2815522Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.2815526Z 
2025-05-07T20:33:05.2815604Z     @given(
2025-05-07T20:33:05.2815726Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2815835Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2815954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2816076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2816202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2816279Z     )
2025-05-07T20:33:05.2816539Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2816634Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2816712Z         self,
2025-05-07T20:33:05.2816798Z         T: int,
2025-05-07T20:33:05.2816877Z         D: int,
2025-05-07T20:33:05.2816977Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2817076Z         contiguous: bool,
2025-05-07T20:33:05.2817163Z         compiled: bool,
2025-05-07T20:33:05.2817243Z     ) -> None:
2025-05-07T20:33:05.2817347Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2817420Z     
2025-05-07T20:33:05.2817594Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2817854Z     
2025-05-07T20:33:05.2817947Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2818075Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2818238Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2818324Z         x0 = x[:, :D]
2025-05-07T20:33:05.2818411Z         x1 = x[:, D:]
2025-05-07T20:33:05.2818483Z     
2025-05-07T20:33:05.2818571Z         if contiguous:
2025-05-07T20:33:05.2818669Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2818758Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2818830Z     
2025-05-07T20:33:05.2818927Z         if scale_ub is not None:
2025-05-07T20:33:05.2819035Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2819172Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2819255Z             )
2025-05-07T20:33:05.2819331Z         else:
2025-05-07T20:33:05.2819426Z             scale_ub_tensor = None
2025-05-07T20:33:05.2819504Z     
2025-05-07T20:33:05.2819636Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2819737Z             op = silu_mul_quant
2025-05-07T20:33:05.2819891Z             if compiled:
2025-05-07T20:33:05.2819995Z                 op = torch.compile(op)
2025-05-07T20:33:05.2820170Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2820245Z     
2025-05-07T20:33:05.2820337Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2820342Z 
2025-05-07T20:33:05.2820446Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2820575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2820677Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2820785Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2821161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2821262Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2821769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2821872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2822253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2822480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2822832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2822935Z     kernel = self.compile(
2025-05-07T20:33:05.2823326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2823512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2823640Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2823645Z 
2025-05-07T20:33:05.2823851Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180fb50>
2025-05-07T20:33:05.2824717Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2825238Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78003b740>}
2025-05-07T20:33:05.2826011Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2826205Z context = <triton._C.libtriton.ir.context object at 0x7ff781827530>
2025-05-07T20:33:05.2826210Z 
2025-05-07T20:33:05.2826384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2826705Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2826818Z                            module_map=module_map)
2025-05-07T20:33:05.2827024Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2827129Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2827208Z E       ^
2025-05-07T20:33:05.2827579Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2827584Z 
2025-05-07T20:33:05.2828010Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2828014Z 
2025-05-07T20:33:05.2828125Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2828353Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2828430Z     T=4096,
2025-05-07T20:33:05.2828514Z     D=5120,
2025-05-07T20:33:05.2828596Z     scale_ub=None,
2025-05-07T20:33:05.2833137Z     contiguous=False,
2025-05-07T20:33:05.2833252Z     compiled=True,
2025-05-07T20:33:05.2833419Z )
2025-05-07T20:33:05.2833656Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2833881Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2833887Z 
2025-05-07T20:33:05.2833976Z     @given(
2025-05-07T20:33:05.2834102Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2834206Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2834342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2834487Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2834628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2834715Z     )
2025-05-07T20:33:05.2834971Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2835081Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2835165Z         self,
2025-05-07T20:33:05.2835247Z         T: int,
2025-05-07T20:33:05.2835333Z         D: int,
2025-05-07T20:33:05.2835440Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2835536Z         contiguous: bool,
2025-05-07T20:33:05.2835635Z         compiled: bool,
2025-05-07T20:33:05.2835717Z     ) -> None:
2025-05-07T20:33:05.2835816Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2835901Z     
2025-05-07T20:33:05.2836077Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2836161Z     
2025-05-07T20:33:05.2836257Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2836385Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2836486Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2836569Z         x0 = x[:, :D]
2025-05-07T20:33:05.2836652Z         x1 = x[:, D:]
2025-05-07T20:33:05.2836736Z     
2025-05-07T20:33:05.2836823Z         if contiguous:
2025-05-07T20:33:05.2836919Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2837022Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2837097Z     
2025-05-07T20:33:05.2837194Z         if scale_ub is not None:
2025-05-07T20:33:05.2837314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2837457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2837549Z             )
2025-05-07T20:33:05.2837629Z         else:
2025-05-07T20:33:05.2837726Z             scale_ub_tensor = None
2025-05-07T20:33:05.2837810Z     
2025-05-07T20:33:05.2837945Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2838039Z             op = silu_mul_quant
2025-05-07T20:33:05.2838136Z             if compiled:
2025-05-07T20:33:05.2838240Z                 op = torch.compile(op)
2025-05-07T20:33:05.2838351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2838436Z     
2025-05-07T20:33:05.2838530Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2838535Z 
2025-05-07T20:33:05.2838637Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2838828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2838935Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2839084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2839476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2839574Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2840203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2840307Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2840677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2840923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2841275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2841383Z     kernel = self.compile(
2025-05-07T20:33:05.2841837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2842058Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2842197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2842202Z 
2025-05-07T20:33:05.2842413Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780fea9d0>
2025-05-07T20:33:05.2843221Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2843743Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780678c20>}
2025-05-07T20:33:05.2844515Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2844720Z context = <triton._C.libtriton.ir.context object at 0x7ff780603070>
2025-05-07T20:33:05.2844725Z 
2025-05-07T20:33:05.2844894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2845173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2845283Z                            module_map=module_map)
2025-05-07T20:33:05.2845448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2845559Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2845637Z E       ^
2025-05-07T20:33:05.2846009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2846016Z 
2025-05-07T20:33:05.2846444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2846451Z 
2025-05-07T20:33:05.2846563Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2846800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2846879Z     T=4096,
2025-05-07T20:33:05.2846957Z     D=5120,
2025-05-07T20:33:05.2847051Z     scale_ub=1200.0,
2025-05-07T20:33:05.2847140Z     contiguous=False,
2025-05-07T20:33:05.2847233Z     compiled=False,
2025-05-07T20:33:05.2847308Z )
2025-05-07T20:33:05.2847534Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2847722Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2847726Z 
2025-05-07T20:33:05.2847808Z     @given(
2025-05-07T20:33:05.2847930Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2848085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2848207Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2848362Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2848492Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2848570Z     )
2025-05-07T20:33:05.2848831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2848927Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2849005Z         self,
2025-05-07T20:33:05.2849090Z         T: int,
2025-05-07T20:33:05.2849167Z         D: int,
2025-05-07T20:33:05.2849270Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2849371Z         contiguous: bool,
2025-05-07T20:33:05.2849459Z         compiled: bool,
2025-05-07T20:33:05.2849543Z     ) -> None:
2025-05-07T20:33:05.2849648Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2849723Z     
2025-05-07T20:33:05.2849902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2849985Z     
2025-05-07T20:33:05.2850159Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2850299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2850427Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2850511Z         x0 = x[:, :D]
2025-05-07T20:33:05.2850602Z         x1 = x[:, D:]
2025-05-07T20:33:05.2850677Z     
2025-05-07T20:33:05.2850766Z         if contiguous:
2025-05-07T20:33:05.2850873Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2850964Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2851038Z     
2025-05-07T20:33:05.2851139Z         if scale_ub is not None:
2025-05-07T20:33:05.2851247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2851386Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2851472Z             )
2025-05-07T20:33:05.2851550Z         else:
2025-05-07T20:33:05.2851655Z             scale_ub_tensor = None
2025-05-07T20:33:05.2851733Z     
2025-05-07T20:33:05.2851867Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2851971Z             op = silu_mul_quant
2025-05-07T20:33:05.2852063Z             if compiled:
2025-05-07T20:33:05.2852168Z                 op = torch.compile(op)
2025-05-07T20:33:05.2852284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2852359Z     
2025-05-07T20:33:05.2852452Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2852457Z 
2025-05-07T20:33:05.2852566Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2852697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2852809Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2852912Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2853427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2853535Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2853910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2854162Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2854558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2854654Z     kernel = self.compile(
2025-05-07T20:33:05.2855057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2855237Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2855367Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2855371Z 
2025-05-07T20:33:05.2855589Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780b0f450>
2025-05-07T20:33:05.2856390Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2857009Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7806796c0>}
2025-05-07T20:33:05.2857784Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2857984Z context = <triton._C.libtriton.ir.context object at 0x7ff780616e70>
2025-05-07T20:33:05.2857988Z 
2025-05-07T20:33:05.2858165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2858441Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2858561Z                            module_map=module_map)
2025-05-07T20:33:05.2858725Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2858869Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2858957Z E       ^
2025-05-07T20:33:05.2859360Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2859365Z 
2025-05-07T20:33:05.2859802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2859807Z 
2025-05-07T20:33:05.2859915Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2860145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2860232Z     T=4096,
2025-05-07T20:33:05.2860310Z     D=5120,
2025-05-07T20:33:05.2860397Z     scale_ub=1200.0,
2025-05-07T20:33:05.2860493Z     contiguous=False,
2025-05-07T20:33:05.2860581Z     compiled=True,
2025-05-07T20:33:05.2860658Z )
2025-05-07T20:33:05.2860889Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2861074Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.2861078Z 
2025-05-07T20:33:05.2861165Z     @given(
2025-05-07T20:33:05.2861287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2861388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2861512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2861632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2861747Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2861830Z     )
2025-05-07T20:33:05.2862083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2862181Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2862267Z         self,
2025-05-07T20:33:05.2862344Z         T: int,
2025-05-07T20:33:05.2862427Z         D: int,
2025-05-07T20:33:05.2862531Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2862623Z         contiguous: bool,
2025-05-07T20:33:05.2862721Z         compiled: bool,
2025-05-07T20:33:05.2862801Z     ) -> None:
2025-05-07T20:33:05.2862903Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2862985Z     
2025-05-07T20:33:05.2863162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2863238Z     
2025-05-07T20:33:05.2863341Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2863470Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2863564Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2863653Z         x0 = x[:, :D]
2025-05-07T20:33:05.2863735Z         x1 = x[:, D:]
2025-05-07T20:33:05.2863811Z     
2025-05-07T20:33:05.2863905Z         if contiguous:
2025-05-07T20:33:05.2863999Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2864099Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2864178Z     
2025-05-07T20:33:05.2864324Z         if scale_ub is not None:
2025-05-07T20:33:05.2864464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2864622Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2864746Z             )
2025-05-07T20:33:05.2864838Z         else:
2025-05-07T20:33:05.2864936Z             scale_ub_tensor = None
2025-05-07T20:33:05.2865010Z     
2025-05-07T20:33:05.2865153Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2865245Z             op = silu_mul_quant
2025-05-07T20:33:05.2865333Z             if compiled:
2025-05-07T20:33:05.2865443Z                 op = torch.compile(op)
2025-05-07T20:33:05.2865550Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2865632Z     
2025-05-07T20:33:05.2865726Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2865730Z 
2025-05-07T20:33:05.2865834Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2865974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2866079Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2866181Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2866614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2866748Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2867271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2867372Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2867746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2867988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2868343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2868441Z     kernel = self.compile(
2025-05-07T20:33:05.2868848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2869038Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2869181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2869186Z 
2025-05-07T20:33:05.2869394Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c73d0>
2025-05-07T20:33:05.2870194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2870727Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78067afc0>}
2025-05-07T20:33:05.2871496Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2871712Z context = <triton._C.libtriton.ir.context object at 0x7ff780ea0db0>
2025-05-07T20:33:05.2871719Z 
2025-05-07T20:33:05.2871890Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2872170Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2872281Z                            module_map=module_map)
2025-05-07T20:33:05.2872448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2872557Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2872637Z E       ^
2025-05-07T20:33:05.2873006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2873011Z 
2025-05-07T20:33:05.2873449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2873501Z 
2025-05-07T20:33:05.2873610Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2873888Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2873968Z     T=2048,
2025-05-07T20:33:05.2874046Z     D=7168,
2025-05-07T20:33:05.2874140Z     scale_ub=1200.0,
2025-05-07T20:33:05.2874230Z     contiguous=False,
2025-05-07T20:33:05.2874319Z     compiled=False,
2025-05-07T20:33:05.2874401Z )
2025-05-07T20:33:05.2874628Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2874811Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2874826Z 
2025-05-07T20:33:05.2874905Z     @given(
2025-05-07T20:33:05.2875025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2875134Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2875257Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2875378Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2875543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2875623Z     )
2025-05-07T20:33:05.2875914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2876016Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2876094Z         self,
2025-05-07T20:33:05.2876174Z         T: int,
2025-05-07T20:33:05.2876257Z         D: int,
2025-05-07T20:33:05.2876360Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2876459Z         contiguous: bool,
2025-05-07T20:33:05.2876547Z         compiled: bool,
2025-05-07T20:33:05.2876626Z     ) -> None:
2025-05-07T20:33:05.2876734Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2876808Z     
2025-05-07T20:33:05.2876984Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2877070Z     
2025-05-07T20:33:05.2877169Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2877296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2877397Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2877482Z         x0 = x[:, :D]
2025-05-07T20:33:05.2877565Z         x1 = x[:, D:]
2025-05-07T20:33:05.2877649Z     
2025-05-07T20:33:05.2877734Z         if contiguous:
2025-05-07T20:33:05.2877826Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2877923Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2877999Z     
2025-05-07T20:33:05.2878105Z         if scale_ub is not None:
2025-05-07T20:33:05.2878212Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2878350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2878433Z             )
2025-05-07T20:33:05.2878511Z         else:
2025-05-07T20:33:05.2878610Z             scale_ub_tensor = None
2025-05-07T20:33:05.2878692Z     
2025-05-07T20:33:05.2878826Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2878920Z             op = silu_mul_quant
2025-05-07T20:33:05.2879016Z             if compiled:
2025-05-07T20:33:05.2879121Z                 op = torch.compile(op)
2025-05-07T20:33:05.2879232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2879318Z     
2025-05-07T20:33:05.2879411Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2879416Z 
2025-05-07T20:33:05.2879523Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2879658Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2879761Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2879871Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2880517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2880619Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2880997Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2881285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2881721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2881820Z     kernel = self.compile(
2025-05-07T20:33:05.2882217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2882406Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2882535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2882539Z 
2025-05-07T20:33:05.2882757Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780b0eb50>
2025-05-07T20:33:05.2883558Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2884087Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff78067bec0>}
2025-05-07T20:33:05.2884943Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2885143Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfe22970>
2025-05-07T20:33:05.2885148Z 
2025-05-07T20:33:05.2885323Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2885596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2885712Z                            module_map=module_map)
2025-05-07T20:33:05.2885877Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2885979Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2886067Z E       ^
2025-05-07T20:33:05.2886440Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2886447Z 
2025-05-07T20:33:05.2886871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2886883Z 
2025-05-07T20:33:05.2886988Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2887217Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2887304Z     T=1,
2025-05-07T20:33:05.2887384Z     D=7168,
2025-05-07T20:33:05.2887468Z     scale_ub=None,
2025-05-07T20:33:05.2887564Z     contiguous=True,
2025-05-07T20:33:05.2887651Z     compiled=False,
2025-05-07T20:33:05.2887726Z )
2025-05-07T20:33:05.2887963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2888136Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.2888141Z 
2025-05-07T20:33:05.2888230Z     @given(
2025-05-07T20:33:05.2888354Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2888459Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2888584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2888705Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2888822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2888904Z     )
2025-05-07T20:33:05.2889157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2889251Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2889339Z         self,
2025-05-07T20:33:05.2889416Z         T: int,
2025-05-07T20:33:05.2889492Z         D: int,
2025-05-07T20:33:05.2889599Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2889689Z         contiguous: bool,
2025-05-07T20:33:05.2889828Z         compiled: bool,
2025-05-07T20:33:05.2889906Z     ) -> None:
2025-05-07T20:33:05.2890003Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2890086Z     
2025-05-07T20:33:05.2890300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2890377Z     
2025-05-07T20:33:05.2890476Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2890601Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2890691Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2890777Z         x0 = x[:, :D]
2025-05-07T20:33:05.2890857Z         x1 = x[:, D:]
2025-05-07T20:33:05.2890929Z     
2025-05-07T20:33:05.2891020Z         if contiguous:
2025-05-07T20:33:05.2891112Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2891207Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2891281Z     
2025-05-07T20:33:05.2891373Z         if scale_ub is not None:
2025-05-07T20:33:05.2891485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2891627Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2891703Z             )
2025-05-07T20:33:05.2891829Z         else:
2025-05-07T20:33:05.2891924Z             scale_ub_tensor = None
2025-05-07T20:33:05.2891999Z     
2025-05-07T20:33:05.2892177Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2892270Z             op = silu_mul_quant
2025-05-07T20:33:05.2892358Z             if compiled:
2025-05-07T20:33:05.2892465Z                 op = torch.compile(op)
2025-05-07T20:33:05.2892573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2892652Z     
2025-05-07T20:33:05.2892742Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2892746Z 
2025-05-07T20:33:05.2892865Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2892998Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2893101Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2893209Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2893731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2893839Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2894220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2894450Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2894813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2894912Z     kernel = self.compile(
2025-05-07T20:33:05.2895310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2895499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2895632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2895639Z 
2025-05-07T20:33:05.2895856Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c54d0>
2025-05-07T20:33:05.2896668Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2897193Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780e4ccc0>}
2025-05-07T20:33:05.2897971Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2898170Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfe44a30>
2025-05-07T20:33:05.2898175Z 
2025-05-07T20:33:05.2898353Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2899274Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2899452Z                            module_map=module_map)
2025-05-07T20:33:05.2899627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2899728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2899813Z E       ^
2025-05-07T20:33:05.2900181Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2900186Z 
2025-05-07T20:33:05.2900618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2900622Z 
2025-05-07T20:33:05.2900734Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2900964Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2901051Z     T=16384,
2025-05-07T20:33:05.2901132Z     D=7168,
2025-05-07T20:33:05.2901217Z     scale_ub=1200.0,
2025-05-07T20:33:05.2901355Z     contiguous=False,
2025-05-07T20:33:05.2901442Z     compiled=True,
2025-05-07T20:33:05.2901521Z )
2025-05-07T20:33:05.2901793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2901981Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.2901986Z 
2025-05-07T20:33:05.2902063Z     @given(
2025-05-07T20:33:05.2902191Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2902292Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2902416Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2902535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2902650Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2902731Z     )
2025-05-07T20:33:05.2902984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2903082Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2903167Z         self,
2025-05-07T20:33:05.2903248Z         T: int,
2025-05-07T20:33:05.2903325Z         D: int,
2025-05-07T20:33:05.2903435Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2903525Z         contiguous: bool,
2025-05-07T20:33:05.2903611Z         compiled: bool,
2025-05-07T20:33:05.2903696Z     ) -> None:
2025-05-07T20:33:05.2903792Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2903865Z     
2025-05-07T20:33:05.2904048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2904124Z     
2025-05-07T20:33:05.2904225Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2904353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2904443Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2904529Z         x0 = x[:, :D]
2025-05-07T20:33:05.2904611Z         x1 = x[:, D:]
2025-05-07T20:33:05.2904684Z     
2025-05-07T20:33:05.2904777Z         if contiguous:
2025-05-07T20:33:05.2904871Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2904963Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2905043Z     
2025-05-07T20:33:05.2905139Z         if scale_ub is not None:
2025-05-07T20:33:05.2905250Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2905395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2905473Z             )
2025-05-07T20:33:05.2905555Z         else:
2025-05-07T20:33:05.2905654Z             scale_ub_tensor = None
2025-05-07T20:33:05.2905727Z     
2025-05-07T20:33:05.2905865Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2905956Z             op = silu_mul_quant
2025-05-07T20:33:05.2906041Z             if compiled:
2025-05-07T20:33:05.2906149Z                 op = torch.compile(op)
2025-05-07T20:33:05.2906256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2906330Z     
2025-05-07T20:33:05.2906428Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2906480Z 
2025-05-07T20:33:05.2906580Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2906720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2906860Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2906965Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2907357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2907451Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2907963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2908071Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2908442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2908680Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2909037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2909172Z     kernel = self.compile(
2025-05-07T20:33:05.2909617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2909803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2909934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2909944Z 
2025-05-07T20:33:05.2910160Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff43d50>
2025-05-07T20:33:05.2910962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2911497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780e4e0c0>}
2025-05-07T20:33:05.2912281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2912487Z context = <triton._C.libtriton.ir.context object at 0x7ff5cffa1cf0>
2025-05-07T20:33:05.2912492Z 
2025-05-07T20:33:05.2912663Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2912940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2913060Z                            module_map=module_map)
2025-05-07T20:33:05.2913226Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2913658Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2913796Z E       ^
2025-05-07T20:33:05.2914231Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2914240Z 
2025-05-07T20:33:05.2914685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2914689Z 
2025-05-07T20:33:05.2914797Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2915030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2915116Z     T=1,
2025-05-07T20:33:05.2915196Z     D=7168,
2025-05-07T20:33:05.2915289Z     scale_ub=None,
2025-05-07T20:33:05.2915378Z     contiguous=False,
2025-05-07T20:33:05.2915464Z     compiled=False,
2025-05-07T20:33:05.2915545Z )
2025-05-07T20:33:05.2915771Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2915946Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.2915950Z 
2025-05-07T20:33:05.2916270Z     @given(
2025-05-07T20:33:05.2916394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2916498Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2916693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2916822Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2916944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2917019Z     )
2025-05-07T20:33:05.2917273Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2917375Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2917452Z         self,
2025-05-07T20:33:05.2917529Z         T: int,
2025-05-07T20:33:05.2917613Z         D: int,
2025-05-07T20:33:05.2917714Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2917805Z         contiguous: bool,
2025-05-07T20:33:05.2917897Z         compiled: bool,
2025-05-07T20:33:05.2917977Z     ) -> None:
2025-05-07T20:33:05.2918078Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2918161Z     
2025-05-07T20:33:05.2918336Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2918481Z     
2025-05-07T20:33:05.2918583Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2918776Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2918873Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2918955Z         x0 = x[:, :D]
2025-05-07T20:33:05.2919035Z         x1 = x[:, D:]
2025-05-07T20:33:05.2919113Z     
2025-05-07T20:33:05.2919198Z         if contiguous:
2025-05-07T20:33:05.2919290Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2919388Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2919460Z     
2025-05-07T20:33:05.2919553Z         if scale_ub is not None:
2025-05-07T20:33:05.2919665Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2919803Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2919878Z             )
2025-05-07T20:33:05.2919965Z         else:
2025-05-07T20:33:05.2920059Z             scale_ub_tensor = None
2025-05-07T20:33:05.2920212Z     
2025-05-07T20:33:05.2920349Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2920443Z             op = silu_mul_quant
2025-05-07T20:33:05.2920537Z             if compiled:
2025-05-07T20:33:05.2920638Z                 op = torch.compile(op)
2025-05-07T20:33:05.2920745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2920825Z     
2025-05-07T20:33:05.2920917Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2920921Z 
2025-05-07T20:33:05.2921020Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2921155Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2921255Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2921364Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2921880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2921981Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2922364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2922597Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2922952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2923054Z     kernel = self.compile(
2025-05-07T20:33:05.2923449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2923636Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2923767Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2923772Z 
2025-05-07T20:33:05.2923983Z self = <triton.compiler.compiler.ASTSource object at 0x7ff7804c4550>
2025-05-07T20:33:05.2924932Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2925461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff780e4ec00>}
2025-05-07T20:33:05.2926240Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2926439Z context = <triton._C.libtriton.ir.context object at 0x7ff5cff2b1b0>
2025-05-07T20:33:05.2926443Z 
2025-05-07T20:33:05.2926620Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2926895Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2927008Z                            module_map=module_map)
2025-05-07T20:33:05.2927221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2927360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2927441Z E       ^
2025-05-07T20:33:05.2927818Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2927822Z 
2025-05-07T20:33:05.2928252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2928257Z 
2025-05-07T20:33:05.2928370Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2928598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2928676Z     T=2048,
2025-05-07T20:33:05.2928761Z     D=7168,
2025-05-07T20:33:05.2928845Z     scale_ub=None,
2025-05-07T20:33:05.2928932Z     contiguous=False,
2025-05-07T20:33:05.2929025Z     compiled=True,
2025-05-07T20:33:05.2929101Z )
2025-05-07T20:33:05.2929328Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2929521Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2929525Z 
2025-05-07T20:33:05.2929604Z     @given(
2025-05-07T20:33:05.2929733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2929834Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2929951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2930078Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2930194Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2930269Z     )
2025-05-07T20:33:05.2930531Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2930627Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2930708Z         self,
2025-05-07T20:33:05.2930796Z         T: int,
2025-05-07T20:33:05.2930873Z         D: int,
2025-05-07T20:33:05.2930982Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2931076Z         contiguous: bool,
2025-05-07T20:33:05.2931167Z         compiled: bool,
2025-05-07T20:33:05.2931255Z     ) -> None:
2025-05-07T20:33:05.2931352Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2931426Z     
2025-05-07T20:33:05.2931608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2931685Z     
2025-05-07T20:33:05.2931779Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2931911Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2932001Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2932082Z         x0 = x[:, :D]
2025-05-07T20:33:05.2932167Z         x1 = x[:, D:]
2025-05-07T20:33:05.2932242Z     
2025-05-07T20:33:05.2932333Z         if contiguous:
2025-05-07T20:33:05.2932426Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2932517Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2932648Z     
2025-05-07T20:33:05.2932740Z         if scale_ub is not None:
2025-05-07T20:33:05.2932851Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2933037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2933116Z             )
2025-05-07T20:33:05.2933196Z         else:
2025-05-07T20:33:05.2933297Z             scale_ub_tensor = None
2025-05-07T20:33:05.2933370Z     
2025-05-07T20:33:05.2933502Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2933600Z             op = silu_mul_quant
2025-05-07T20:33:05.2933685Z             if compiled:
2025-05-07T20:33:05.2933787Z                 op = torch.compile(op)
2025-05-07T20:33:05.2933904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2933977Z     
2025-05-07T20:33:05.2934077Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2934081Z 
2025-05-07T20:33:05.2934180Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2934310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2934422Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2934566Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2934986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2935092Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2935603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2935710Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2936080Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2936315Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2936675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2936773Z     kernel = self.compile(
2025-05-07T20:33:05.2937169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2937366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2937497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2937502Z 
2025-05-07T20:33:05.2937718Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180c3d0>
2025-05-07T20:33:05.2938520Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2939049Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7811442c0>}
2025-05-07T20:33:05.2939828Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2940032Z context = <triton._C.libtriton.ir.context object at 0x7ff7811d06b0>
2025-05-07T20:33:05.2940037Z 
2025-05-07T20:33:05.2940213Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2940486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2940603Z                            module_map=module_map)
2025-05-07T20:33:05.2940770Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2940872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2940958Z E       ^
2025-05-07T20:33:05.2941327Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2941375Z 
2025-05-07T20:33:05.2941806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2941819Z 
2025-05-07T20:33:05.2941962Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2942197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2942291Z     T=4096,
2025-05-07T20:33:05.2942369Z     D=7168,
2025-05-07T20:33:05.2942453Z     scale_ub=None,
2025-05-07T20:33:05.2942551Z     contiguous=False,
2025-05-07T20:33:05.2942634Z     compiled=True,
2025-05-07T20:33:05.2942708Z )
2025-05-07T20:33:05.2942946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2943127Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2943131Z 
2025-05-07T20:33:05.2943216Z     @given(
2025-05-07T20:33:05.2943343Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2943448Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2943570Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2943734Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2943917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2944003Z     )
2025-05-07T20:33:05.2944260Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2944359Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2944443Z         self,
2025-05-07T20:33:05.2944521Z         T: int,
2025-05-07T20:33:05.2944599Z         D: int,
2025-05-07T20:33:05.2944705Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2944798Z         contiguous: bool,
2025-05-07T20:33:05.2944892Z         compiled: bool,
2025-05-07T20:33:05.2944972Z     ) -> None:
2025-05-07T20:33:05.2945072Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2945152Z     
2025-05-07T20:33:05.2945328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2945409Z     
2025-05-07T20:33:05.2945511Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2945641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2945734Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2945825Z         x0 = x[:, :D]
2025-05-07T20:33:05.2945907Z         x1 = x[:, D:]
2025-05-07T20:33:05.2945979Z     
2025-05-07T20:33:05.2946071Z         if contiguous:
2025-05-07T20:33:05.2946163Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2946263Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2946336Z     
2025-05-07T20:33:05.2946426Z         if scale_ub is not None:
2025-05-07T20:33:05.2946541Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2946679Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2946755Z             )
2025-05-07T20:33:05.2946837Z         else:
2025-05-07T20:33:05.2946934Z             scale_ub_tensor = None
2025-05-07T20:33:05.2947005Z     
2025-05-07T20:33:05.2947150Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2947244Z             op = silu_mul_quant
2025-05-07T20:33:05.2947336Z             if compiled:
2025-05-07T20:33:05.2947446Z                 op = torch.compile(op)
2025-05-07T20:33:05.2947556Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2947630Z     
2025-05-07T20:33:05.2947728Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2947733Z 
2025-05-07T20:33:05.2947831Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2947966Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2948067Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2948168Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2948554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2948650Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2949162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2949324Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2949735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2949976Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2950329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2950425Z     kernel = self.compile(
2025-05-07T20:33:05.2950828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2951009Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2951144Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2951148Z 
2025-05-07T20:33:05.2951361Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff43550>
2025-05-07T20:33:05.2952232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2952799Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781144d60>}
2025-05-07T20:33:05.2953573Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2953776Z context = <triton._C.libtriton.ir.context object at 0x7ff7811febb0>
2025-05-07T20:33:05.2953780Z 
2025-05-07T20:33:05.2953951Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2954227Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2954348Z                            module_map=module_map)
2025-05-07T20:33:05.2954519Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2954626Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2954706Z E       ^
2025-05-07T20:33:05.2955074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2955079Z 
2025-05-07T20:33:05.2955515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2955519Z 
2025-05-07T20:33:05.2955626Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2955863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2955943Z     T=16384,
2025-05-07T20:33:05.2956021Z     D=5120,
2025-05-07T20:33:05.2956114Z     scale_ub=1200.0,
2025-05-07T20:33:05.2956201Z     contiguous=False,
2025-05-07T20:33:05.2956287Z     compiled=False,
2025-05-07T20:33:05.2956369Z )
2025-05-07T20:33:05.2956597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2956792Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.2956796Z 
2025-05-07T20:33:05.2956883Z     @given(
2025-05-07T20:33:05.2957006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2957112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2957230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2957350Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2957472Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2957546Z     )
2025-05-07T20:33:05.2957797Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2957898Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2958022Z         self,
2025-05-07T20:33:05.2962595Z         T: int,
2025-05-07T20:33:05.2962708Z         D: int,
2025-05-07T20:33:05.2962824Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2963004Z         contiguous: bool,
2025-05-07T20:33:05.2963101Z         compiled: bool,
2025-05-07T20:33:05.2963197Z     ) -> None:
2025-05-07T20:33:05.2963297Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2963374Z     
2025-05-07T20:33:05.2963564Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2963641Z     
2025-05-07T20:33:05.2963738Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2963878Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2963969Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2964061Z         x0 = x[:, :D]
2025-05-07T20:33:05.2964144Z         x1 = x[:, D:]
2025-05-07T20:33:05.2964219Z     
2025-05-07T20:33:05.2964314Z         if contiguous:
2025-05-07T20:33:05.2964414Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2964506Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2964641Z     
2025-05-07T20:33:05.2964737Z         if scale_ub is not None:
2025-05-07T20:33:05.2964850Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2965041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2965122Z             )
2025-05-07T20:33:05.2965202Z         else:
2025-05-07T20:33:05.2965310Z             scale_ub_tensor = None
2025-05-07T20:33:05.2965386Z     
2025-05-07T20:33:05.2965524Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2965625Z             op = silu_mul_quant
2025-05-07T20:33:05.2965717Z             if compiled:
2025-05-07T20:33:05.2965830Z                 op = torch.compile(op)
2025-05-07T20:33:05.2965941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2966017Z     
2025-05-07T20:33:05.2966118Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2966123Z 
2025-05-07T20:33:05.2966227Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2966362Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2966479Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2966589Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2967112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2967223Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2967595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2967836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2968193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2968293Z     kernel = self.compile(
2025-05-07T20:33:05.2968702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2968890Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2969039Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2969044Z 
2025-05-07T20:33:05.2969256Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78180fa50>
2025-05-07T20:33:05.2970065Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2970598Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781145c60>}
2025-05-07T20:33:05.2971371Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2971632Z context = <triton._C.libtriton.ir.context object at 0x7ff7801f2570>
2025-05-07T20:33:05.2971673Z 
2025-05-07T20:33:05.2971851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2972128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2972247Z                            module_map=module_map)
2025-05-07T20:33:05.2972417Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2972527Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2972607Z E       ^
2025-05-07T20:33:05.2972976Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2972981Z 
2025-05-07T20:33:05.2973424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2973431Z 
2025-05-07T20:33:05.2973540Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2973825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2973948Z     T=16384,
2025-05-07T20:33:05.2974031Z     D=5120,
2025-05-07T20:33:05.2974125Z     scale_ub=1200.0,
2025-05-07T20:33:05.2974214Z     contiguous=True,
2025-05-07T20:33:05.2974300Z     compiled=True,
2025-05-07T20:33:05.2974384Z )
2025-05-07T20:33:05.2974612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2974797Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.2974801Z 
2025-05-07T20:33:05.2974888Z     @given(
2025-05-07T20:33:05.2975015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2975128Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2975249Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2975375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2975499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2975579Z     )
2025-05-07T20:33:05.2975843Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2975949Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2976028Z         self,
2025-05-07T20:33:05.2976108Z         T: int,
2025-05-07T20:33:05.2976195Z         D: int,
2025-05-07T20:33:05.2976298Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2976391Z         contiguous: bool,
2025-05-07T20:33:05.2976487Z         compiled: bool,
2025-05-07T20:33:05.2976567Z     ) -> None:
2025-05-07T20:33:05.2976672Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2976747Z     
2025-05-07T20:33:05.2976923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2977005Z     
2025-05-07T20:33:05.2977100Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2977231Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2977332Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2977418Z         x0 = x[:, :D]
2025-05-07T20:33:05.2977503Z         x1 = x[:, D:]
2025-05-07T20:33:05.2977586Z     
2025-05-07T20:33:05.2977678Z         if contiguous:
2025-05-07T20:33:05.2977771Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2977872Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2977946Z     
2025-05-07T20:33:05.2978050Z         if scale_ub is not None:
2025-05-07T20:33:05.2978160Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2978307Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2978395Z             )
2025-05-07T20:33:05.2978476Z         else:
2025-05-07T20:33:05.2978575Z             scale_ub_tensor = None
2025-05-07T20:33:05.2978660Z     
2025-05-07T20:33:05.2978795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2978889Z             op = silu_mul_quant
2025-05-07T20:33:05.2979034Z             if compiled:
2025-05-07T20:33:05.2979138Z                 op = torch.compile(op)
2025-05-07T20:33:05.2979250Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2979381Z     
2025-05-07T20:33:05.2979479Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2979484Z 
2025-05-07T20:33:05.2979595Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2979727Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2979834Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2979944Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2980323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2980421Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2980942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2981048Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2981428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2981778Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2982137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2982245Z     kernel = self.compile(
2025-05-07T20:33:05.2982648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2982832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2982974Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2982979Z 
2025-05-07T20:33:05.2983196Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78118e850>
2025-05-07T20:33:05.2984008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2984599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff781147380>}
2025-05-07T20:33:05.2985380Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2985581Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfe54af0>
2025-05-07T20:33:05.2985585Z 
2025-05-07T20:33:05.2985758Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2986040Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.2986156Z                            module_map=module_map)
2025-05-07T20:33:05.2986331Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.2986440Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.2986520Z E       ^
2025-05-07T20:33:05.2986901Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.2986906Z 
2025-05-07T20:33:05.2987341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.2987346Z 
2025-05-07T20:33:05.2987455Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.2987697Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.2987778Z     T=16384,
2025-05-07T20:33:05.2987864Z     D=5120,
2025-05-07T20:33:05.2987950Z     scale_ub=None,
2025-05-07T20:33:05.2988043Z     contiguous=False,
2025-05-07T20:33:05.2988139Z     compiled=True,
2025-05-07T20:33:05.2988264Z )
2025-05-07T20:33:05.2988490Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.2988727Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.2988734Z 
2025-05-07T20:33:05.2988816Z     @given(
2025-05-07T20:33:05.2988940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.2989051Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.2989169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.2989301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.2989419Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.2989496Z     )
2025-05-07T20:33:05.2989761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.2989858Z     def test_silu_mul_quant(
2025-05-07T20:33:05.2989937Z         self,
2025-05-07T20:33:05.2990024Z         T: int,
2025-05-07T20:33:05.2990107Z         D: int,
2025-05-07T20:33:05.2990209Z         scale_ub: Optional[float],
2025-05-07T20:33:05.2990310Z         contiguous: bool,
2025-05-07T20:33:05.2990443Z         compiled: bool,
2025-05-07T20:33:05.2990528Z     ) -> None:
2025-05-07T20:33:05.2990670Z         torch.manual_seed(2025)
2025-05-07T20:33:05.2990746Z     
2025-05-07T20:33:05.2990931Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.2991008Z     
2025-05-07T20:33:05.2991105Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.2991238Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.2991328Z         x = x_sign * x_clamp
2025-05-07T20:33:05.2991409Z         x0 = x[:, :D]
2025-05-07T20:33:05.2991498Z         x1 = x[:, D:]
2025-05-07T20:33:05.2991572Z     
2025-05-07T20:33:05.2991658Z         if contiguous:
2025-05-07T20:33:05.2991760Z             x0 = x0.contiguous()
2025-05-07T20:33:05.2991851Z             x1 = x1.contiguous()
2025-05-07T20:33:05.2991926Z     
2025-05-07T20:33:05.2992029Z         if scale_ub is not None:
2025-05-07T20:33:05.2992137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.2992292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.2992374Z             )
2025-05-07T20:33:05.2992458Z         else:
2025-05-07T20:33:05.2992562Z             scale_ub_tensor = None
2025-05-07T20:33:05.2992640Z     
2025-05-07T20:33:05.2992775Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.2992876Z             op = silu_mul_quant
2025-05-07T20:33:05.2992964Z             if compiled:
2025-05-07T20:33:05.2993069Z                 op = torch.compile(op)
2025-05-07T20:33:05.2993186Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2993263Z     
2025-05-07T20:33:05.2993357Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.2993362Z 
2025-05-07T20:33:05.2993470Z moe/activation_test.py:117: 
2025-05-07T20:33:05.2993603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2993717Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.2993820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.2994214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.2994323Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.2994878Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.2994980Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.2995362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.2995599Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.2995966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.2996067Z     kernel = self.compile(
2025-05-07T20:33:05.2996515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.2996747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.2996884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.2996889Z 
2025-05-07T20:33:05.2997112Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780741a50>
2025-05-07T20:33:05.2997916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.2998440Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802d85e0>}
2025-05-07T20:33:05.2999216Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.2999496Z context = <triton._C.libtriton.ir.context object at 0x7ff780224430>
2025-05-07T20:33:05.2999501Z 
2025-05-07T20:33:05.2999682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.2999957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3000187Z                            module_map=module_map)
2025-05-07T20:33:05.3000365Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3000468Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3000557Z E       ^
2025-05-07T20:33:05.3000926Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3000931Z 
2025-05-07T20:33:05.3001363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3001371Z 
2025-05-07T20:33:05.3001486Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3001722Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3001802Z     T=2048,
2025-05-07T20:33:05.3001891Z     D=5120,
2025-05-07T20:33:05.3001976Z     scale_ub=None,
2025-05-07T20:33:05.3002076Z     contiguous=False,
2025-05-07T20:33:05.3002161Z     compiled=True,
2025-05-07T20:33:05.3002237Z )
2025-05-07T20:33:05.3002470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3002650Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.3002655Z 
2025-05-07T20:33:05.3002735Z     @given(
2025-05-07T20:33:05.3002865Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3002969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3003092Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3003219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3003343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3003426Z     )
2025-05-07T20:33:05.3003684Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3003783Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3003868Z         self,
2025-05-07T20:33:05.3003949Z         T: int,
2025-05-07T20:33:05.3004029Z         D: int,
2025-05-07T20:33:05.3004139Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3004234Z         contiguous: bool,
2025-05-07T20:33:05.3004322Z         compiled: bool,
2025-05-07T20:33:05.3004410Z     ) -> None:
2025-05-07T20:33:05.3004508Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3004586Z     
2025-05-07T20:33:05.3004768Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3004845Z     
2025-05-07T20:33:05.3005000Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3005128Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3005223Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3005349Z         x0 = x[:, :D]
2025-05-07T20:33:05.3005434Z         x1 = x[:, D:]
2025-05-07T20:33:05.3005513Z     
2025-05-07T20:33:05.3005605Z         if contiguous:
2025-05-07T20:33:05.3005698Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3005791Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3005872Z     
2025-05-07T20:33:05.3005965Z         if scale_ub is not None:
2025-05-07T20:33:05.3006073Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3006221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3006298Z             )
2025-05-07T20:33:05.3006385Z         else:
2025-05-07T20:33:05.3006482Z             scale_ub_tensor = None
2025-05-07T20:33:05.3006558Z     
2025-05-07T20:33:05.3006700Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3006797Z             op = silu_mul_quant
2025-05-07T20:33:05.3006884Z             if compiled:
2025-05-07T20:33:05.3007040Z                 op = torch.compile(op)
2025-05-07T20:33:05.3007153Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3007264Z     
2025-05-07T20:33:05.3007366Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3007370Z 
2025-05-07T20:33:05.3007472Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3007604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3007717Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3007821Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3008212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3008310Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3008825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3008938Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3009322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3009559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3009925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3010025Z     kernel = self.compile(
2025-05-07T20:33:05.3010433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3010617Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3010752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3010756Z 
2025-05-07T20:33:05.3010977Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780fe9f50>
2025-05-07T20:33:05.3011786Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3012322Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802d9440>}
2025-05-07T20:33:05.3013094Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3013301Z context = <triton._C.libtriton.ir.context object at 0x7ff78027be30>
2025-05-07T20:33:05.3013306Z 
2025-05-07T20:33:05.3013881Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3014190Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3014482Z                            module_map=module_map)
2025-05-07T20:33:05.3014755Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3014862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3014949Z E       ^
2025-05-07T20:33:05.3015319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3015324Z 
2025-05-07T20:33:05.3015762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3015767Z 
2025-05-07T20:33:05.3015877Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3016109Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3016196Z     T=2048,
2025-05-07T20:33:05.3016279Z     D=5120,
2025-05-07T20:33:05.3016374Z     scale_ub=1200.0,
2025-05-07T20:33:05.3016467Z     contiguous=False,
2025-05-07T20:33:05.3016551Z     compiled=True,
2025-05-07T20:33:05.3016631Z )
2025-05-07T20:33:05.3016926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3017170Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.3017175Z 
2025-05-07T20:33:05.3017264Z     @given(
2025-05-07T20:33:05.3017388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3017497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3017614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3017735Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3017862Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3017939Z     )
2025-05-07T20:33:05.3018193Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3018296Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3018376Z         self,
2025-05-07T20:33:05.3018457Z         T: int,
2025-05-07T20:33:05.3018546Z         D: int,
2025-05-07T20:33:05.3018652Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3018744Z         contiguous: bool,
2025-05-07T20:33:05.3018843Z         compiled: bool,
2025-05-07T20:33:05.3018926Z     ) -> None:
2025-05-07T20:33:05.3019032Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3019108Z     
2025-05-07T20:33:05.3019284Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3019366Z     
2025-05-07T20:33:05.3019461Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3019590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3019686Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3019770Z         x0 = x[:, :D]
2025-05-07T20:33:05.3019851Z         x1 = x[:, D:]
2025-05-07T20:33:05.3019930Z     
2025-05-07T20:33:05.3020015Z         if contiguous:
2025-05-07T20:33:05.3020108Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3020206Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3020279Z     
2025-05-07T20:33:05.3020370Z         if scale_ub is not None:
2025-05-07T20:33:05.3020487Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3020628Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3020714Z             )
2025-05-07T20:33:05.3020790Z         else:
2025-05-07T20:33:05.3020885Z             scale_ub_tensor = None
2025-05-07T20:33:05.3020964Z     
2025-05-07T20:33:05.3021097Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3021190Z             op = silu_mul_quant
2025-05-07T20:33:05.3021281Z             if compiled:
2025-05-07T20:33:05.3021381Z                 op = torch.compile(op)
2025-05-07T20:33:05.3021487Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3021568Z     
2025-05-07T20:33:05.3021658Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3021663Z 
2025-05-07T20:33:05.3021767Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3021946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3022049Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3022157Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3022575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3022671Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3023186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3023286Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3023661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3023893Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3024246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3024353Z     kernel = self.compile(
2025-05-07T20:33:05.3024744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3025024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3025157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3025162Z 
2025-05-07T20:33:05.3025381Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cff6fed0>
2025-05-07T20:33:05.3026182Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3026705Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802da660>}
2025-05-07T20:33:05.3027485Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3027685Z context = <triton._C.libtriton.ir.context object at 0x7ff7802f73f0>
2025-05-07T20:33:05.3027690Z 
2025-05-07T20:33:05.3027865Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3028136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3028251Z                            module_map=module_map)
2025-05-07T20:33:05.3028416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3028520Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3028605Z E       ^
2025-05-07T20:33:05.3028969Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3028977Z 
2025-05-07T20:33:05.3029402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3029409Z 
2025-05-07T20:33:05.3029528Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3029758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3029842Z     T=4096,
2025-05-07T20:33:05.3029921Z     D=5120,
2025-05-07T20:33:05.3030010Z     scale_ub=1200.0,
2025-05-07T20:33:05.3030104Z     contiguous=True,
2025-05-07T20:33:05.3030189Z     compiled=True,
2025-05-07T20:33:05.3030262Z )
2025-05-07T20:33:05.3030494Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3030671Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.3030676Z 
2025-05-07T20:33:05.3030755Z     @given(
2025-05-07T20:33:05.3030883Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3031029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3031150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3031309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3031429Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3031509Z     )
2025-05-07T20:33:05.3031763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3031858Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3031941Z         self,
2025-05-07T20:33:05.3032019Z         T: int,
2025-05-07T20:33:05.3032095Z         D: int,
2025-05-07T20:33:05.3032200Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3032291Z         contiguous: bool,
2025-05-07T20:33:05.3032379Z         compiled: bool,
2025-05-07T20:33:05.3032463Z     ) -> None:
2025-05-07T20:33:05.3032559Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3032638Z     
2025-05-07T20:33:05.3032814Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3032893Z     
2025-05-07T20:33:05.3032993Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3033164Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3033258Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3033385Z         x0 = x[:, :D]
2025-05-07T20:33:05.3033468Z         x1 = x[:, D:]
2025-05-07T20:33:05.3033541Z     
2025-05-07T20:33:05.3033632Z         if contiguous:
2025-05-07T20:33:05.3033723Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3033813Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3033891Z     
2025-05-07T20:33:05.3033983Z         if scale_ub is not None:
2025-05-07T20:33:05.3034095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3034234Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3034310Z             )
2025-05-07T20:33:05.3034394Z         else:
2025-05-07T20:33:05.3034492Z             scale_ub_tensor = None
2025-05-07T20:33:05.3034566Z     
2025-05-07T20:33:05.3034711Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3034806Z             op = silu_mul_quant
2025-05-07T20:33:05.3034895Z             if compiled:
2025-05-07T20:33:05.3035012Z                 op = torch.compile(op)
2025-05-07T20:33:05.3035125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3035198Z     
2025-05-07T20:33:05.3035295Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3035300Z 
2025-05-07T20:33:05.3035400Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3035535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3035639Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3035740Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3036123Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3036219Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3036730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3036841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3037215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3037453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3037804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3037902Z     kernel = self.compile(
2025-05-07T20:33:05.3038305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3038487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3038617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3038630Z 
2025-05-07T20:33:05.3038889Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780741f50>
2025-05-07T20:33:05.3039729Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3040315Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff7802db9c0>}
2025-05-07T20:33:05.3041087Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3041292Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfc72c30>
2025-05-07T20:33:05.3041296Z 
2025-05-07T20:33:05.3041464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3041738Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3041893Z                            module_map=module_map)
2025-05-07T20:33:05.3042099Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3042213Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3042292Z E       ^
2025-05-07T20:33:05.3042660Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3042665Z 
2025-05-07T20:33:05.3043102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3043106Z 
2025-05-07T20:33:05.3043214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3043445Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3043536Z     T=128,
2025-05-07T20:33:05.3043614Z     D=5120,
2025-05-07T20:33:05.3043709Z     scale_ub=1200.0,
2025-05-07T20:33:05.3043798Z     contiguous=False,
2025-05-07T20:33:05.3043886Z     compiled=True,
2025-05-07T20:33:05.3043968Z )
2025-05-07T20:33:05.3044222Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3044423Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.3044427Z 
2025-05-07T20:33:05.3044514Z     @given(
2025-05-07T20:33:05.3044639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3044740Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3044864Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3044983Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3045106Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3045181Z     )
2025-05-07T20:33:05.3045434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3045537Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3045615Z         self,
2025-05-07T20:33:05.3045692Z         T: int,
2025-05-07T20:33:05.3045785Z         D: int,
2025-05-07T20:33:05.3045887Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3045981Z         contiguous: bool,
2025-05-07T20:33:05.3046075Z         compiled: bool,
2025-05-07T20:33:05.3046154Z     ) -> None:
2025-05-07T20:33:05.3046250Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3046337Z     
2025-05-07T20:33:05.3046511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3046594Z     
2025-05-07T20:33:05.3046689Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3046817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3046913Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3046994Z         x0 = x[:, :D]
2025-05-07T20:33:05.3047075Z         x1 = x[:, D:]
2025-05-07T20:33:05.3047156Z     
2025-05-07T20:33:05.3047242Z         if contiguous:
2025-05-07T20:33:05.3047405Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3047505Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3047580Z     
2025-05-07T20:33:05.3047672Z         if scale_ub is not None:
2025-05-07T20:33:05.3047821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3047966Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3048049Z             )
2025-05-07T20:33:05.3048127Z         else:
2025-05-07T20:33:05.3048222Z             scale_ub_tensor = None
2025-05-07T20:33:05.3048300Z     
2025-05-07T20:33:05.3048432Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3048525Z             op = silu_mul_quant
2025-05-07T20:33:05.3048616Z             if compiled:
2025-05-07T20:33:05.3048716Z                 op = torch.compile(op)
2025-05-07T20:33:05.3048823Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3048901Z     
2025-05-07T20:33:05.3048992Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3048997Z 
2025-05-07T20:33:05.3049099Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3049234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3049379Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3049523Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3049901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3049996Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3050511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3050610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3050979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3051217Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3051570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3051677Z     kernel = self.compile(
2025-05-07T20:33:05.3052079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3052259Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3052394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3052399Z 
2025-05-07T20:33:05.3052609Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78118d3d0>
2025-05-07T20:33:05.3053417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3053947Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfcd0fe0>}
2025-05-07T20:33:05.3054775Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3054983Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfcc6170>
2025-05-07T20:33:05.3054987Z 
2025-05-07T20:33:05.3055157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3055438Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3055548Z                            module_map=module_map)
2025-05-07T20:33:05.3055715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3055821Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3055900Z E       ^
2025-05-07T20:33:05.3056271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3056341Z 
2025-05-07T20:33:05.3056811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3056818Z 
2025-05-07T20:33:05.3056926Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3057164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3057242Z     T=16384,
2025-05-07T20:33:05.3057321Z     D=7168,
2025-05-07T20:33:05.3057413Z     scale_ub=1200.0,
2025-05-07T20:33:05.3057499Z     contiguous=True,
2025-05-07T20:33:05.3057591Z     compiled=True,
2025-05-07T20:33:05.3057667Z )
2025-05-07T20:33:05.3057891Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3058079Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.3058084Z 
2025-05-07T20:33:05.3058161Z     @given(
2025-05-07T20:33:05.3058284Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3058392Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3058551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3058709Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3058835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3058911Z     )
2025-05-07T20:33:05.3059173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3059267Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3059345Z         self,
2025-05-07T20:33:05.3059430Z         T: int,
2025-05-07T20:33:05.3059507Z         D: int,
2025-05-07T20:33:05.3059607Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3059706Z         contiguous: bool,
2025-05-07T20:33:05.3059794Z         compiled: bool,
2025-05-07T20:33:05.3059871Z     ) -> None:
2025-05-07T20:33:05.3059975Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3060051Z     
2025-05-07T20:33:05.3060223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3060307Z     
2025-05-07T20:33:05.3060402Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3060539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3060631Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3060712Z         x0 = x[:, :D]
2025-05-07T20:33:05.3060805Z         x1 = x[:, D:]
2025-05-07T20:33:05.3060880Z     
2025-05-07T20:33:05.3060964Z         if contiguous:
2025-05-07T20:33:05.3061061Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3061152Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3061226Z     
2025-05-07T20:33:05.3061325Z         if scale_ub is not None:
2025-05-07T20:33:05.3061434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3061571Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3061655Z             )
2025-05-07T20:33:05.3061734Z         else:
2025-05-07T20:33:05.3061838Z             scale_ub_tensor = None
2025-05-07T20:33:05.3061911Z     
2025-05-07T20:33:05.3062042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3062142Z             op = silu_mul_quant
2025-05-07T20:33:05.3062230Z             if compiled:
2025-05-07T20:33:05.3062334Z                 op = torch.compile(op)
2025-05-07T20:33:05.3062446Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3062518Z     
2025-05-07T20:33:05.3062610Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3062615Z 
2025-05-07T20:33:05.3062719Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3062848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3062951Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3063058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3063436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3063537Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3064096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3064236Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3064617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3064848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3065207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3065304Z     kernel = self.compile(
2025-05-07T20:33:05.3065700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3065888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3066018Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3066025Z 
2025-05-07T20:33:05.3066236Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780a24bd0>
2025-05-07T20:33:05.3067130Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3067655Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfcd1e40>}
2025-05-07T20:33:05.3068432Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3068631Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfd41970>
2025-05-07T20:33:05.3068636Z 
2025-05-07T20:33:05.3068812Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3069085Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3069204Z                            module_map=module_map)
2025-05-07T20:33:05.3069374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3069475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3069555Z E       ^
2025-05-07T20:33:05.3069925Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3069930Z 
2025-05-07T20:33:05.3070356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3070361Z 
2025-05-07T20:33:05.3070472Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3070701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3070783Z     T=16384,
2025-05-07T20:33:05.3070869Z     D=5120,
2025-05-07T20:33:05.3070953Z     scale_ub=1200.0,
2025-05-07T20:33:05.3071043Z     contiguous=True,
2025-05-07T20:33:05.3071138Z     compiled=False,
2025-05-07T20:33:05.3071215Z )
2025-05-07T20:33:05.3071447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3071630Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3071634Z 
2025-05-07T20:33:05.3071712Z     @given(
2025-05-07T20:33:05.3071841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3071942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3072057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3072182Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3072296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3072370Z     )
2025-05-07T20:33:05.3072632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3072774Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3072862Z         self,
2025-05-07T20:33:05.3072941Z         T: int,
2025-05-07T20:33:05.3073056Z         D: int,
2025-05-07T20:33:05.3073166Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3073259Z         contiguous: bool,
2025-05-07T20:33:05.3073347Z         compiled: bool,
2025-05-07T20:33:05.3073431Z     ) -> None:
2025-05-07T20:33:05.3073528Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3073602Z     
2025-05-07T20:33:05.3073783Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3073858Z     
2025-05-07T20:33:05.3073953Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3074084Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3074176Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3074267Z         x0 = x[:, :D]
2025-05-07T20:33:05.3074348Z         x1 = x[:, D:]
2025-05-07T20:33:05.3074424Z     
2025-05-07T20:33:05.3074515Z         if contiguous:
2025-05-07T20:33:05.3074609Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3074766Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3074846Z     
2025-05-07T20:33:05.3074943Z         if scale_ub is not None:
2025-05-07T20:33:05.3075110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3075259Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3075335Z             )
2025-05-07T20:33:05.3075412Z         else:
2025-05-07T20:33:05.3075514Z             scale_ub_tensor = None
2025-05-07T20:33:05.3075589Z     
2025-05-07T20:33:05.3075722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3075821Z             op = silu_mul_quant
2025-05-07T20:33:05.3075905Z             if compiled:
2025-05-07T20:33:05.3076013Z                 op = torch.compile(op)
2025-05-07T20:33:05.3076119Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3076193Z     
2025-05-07T20:33:05.3076292Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3076297Z 
2025-05-07T20:33:05.3076394Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3076530Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3076642Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3076743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3077263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3077362Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3077730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3077965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3078316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3078411Z     kernel = self.compile(
2025-05-07T20:33:05.3078816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3079003Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3079141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3079146Z 
2025-05-07T20:33:05.3079357Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78015d2d0>
2025-05-07T20:33:05.3080207Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3080736Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfcd2ca0>}
2025-05-07T20:33:05.3081506Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3081798Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfdf34b0>
2025-05-07T20:33:05.3081803Z 
2025-05-07T20:33:05.3081973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3082251Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3082359Z                            module_map=module_map)
2025-05-07T20:33:05.3082523Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3082628Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3082704Z E       ^
2025-05-07T20:33:05.3083071Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3083076Z 
2025-05-07T20:33:05.3083514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3083559Z 
2025-05-07T20:33:05.3083668Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3083941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3084020Z     T=1,
2025-05-07T20:33:05.3084097Z     D=7168,
2025-05-07T20:33:05.3084187Z     scale_ub=1200.0,
2025-05-07T20:33:05.3084276Z     contiguous=False,
2025-05-07T20:33:05.3084361Z     compiled=False,
2025-05-07T20:33:05.3084439Z )
2025-05-07T20:33:05.3084662Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3084834Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.3084839Z 
2025-05-07T20:33:05.3084925Z     @given(
2025-05-07T20:33:05.3085045Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3085151Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3085271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3085390Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3085518Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3085595Z     )
2025-05-07T20:33:05.3085847Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3085951Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3086027Z         self,
2025-05-07T20:33:05.3086104Z         T: int,
2025-05-07T20:33:05.3086186Z         D: int,
2025-05-07T20:33:05.3086287Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3086385Z         contiguous: bool,
2025-05-07T20:33:05.3086472Z         compiled: bool,
2025-05-07T20:33:05.3086552Z     ) -> None:
2025-05-07T20:33:05.3086654Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3086726Z     
2025-05-07T20:33:05.3086902Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3086985Z     
2025-05-07T20:33:05.3087078Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3091925Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3092055Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3092151Z         x0 = x[:, :D]
2025-05-07T20:33:05.3092238Z         x1 = x[:, D:]
2025-05-07T20:33:05.3092314Z     
2025-05-07T20:33:05.3092410Z         if contiguous:
2025-05-07T20:33:05.3092505Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3092600Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3092683Z     
2025-05-07T20:33:05.3092778Z         if scale_ub is not None:
2025-05-07T20:33:05.3092892Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3093044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3093123Z             )
2025-05-07T20:33:05.3093204Z         else:
2025-05-07T20:33:05.3093311Z             scale_ub_tensor = None
2025-05-07T20:33:05.3093387Z     
2025-05-07T20:33:05.3093532Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3093706Z             op = silu_mul_quant
2025-05-07T20:33:05.3093795Z             if compiled:
2025-05-07T20:33:05.3093910Z                 op = torch.compile(op)
2025-05-07T20:33:05.3094063Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3094139Z     
2025-05-07T20:33:05.3094243Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3094248Z 
2025-05-07T20:33:05.3094350Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3094486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3094598Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3094703Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3095237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3095339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3095713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3095957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3096401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3096502Z     kernel = self.compile(
2025-05-07T20:33:05.3096908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3097091Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3097229Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3097234Z 
2025-05-07T20:33:05.3097446Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa7afd0>
2025-05-07T20:33:05.3098254Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3098802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0c0e0>}
2025-05-07T20:33:05.3099575Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3099783Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfde9d30>
2025-05-07T20:33:05.3099788Z 
2025-05-07T20:33:05.3099959Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3100241Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3100353Z                            module_map=module_map)
2025-05-07T20:33:05.3100521Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3100634Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3100720Z E       ^
2025-05-07T20:33:05.3101093Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3101098Z 
2025-05-07T20:33:05.3101533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3101538Z 
2025-05-07T20:33:05.3101644Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3101883Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3101965Z     T=4096,
2025-05-07T20:33:05.3102043Z     D=7168,
2025-05-07T20:33:05.3102140Z     scale_ub=1200.0,
2025-05-07T20:33:05.3102231Z     contiguous=False,
2025-05-07T20:33:05.3102317Z     compiled=True,
2025-05-07T20:33:05.3102400Z )
2025-05-07T20:33:05.3102625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3102853Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.3102868Z 
2025-05-07T20:33:05.3102950Z     @given(
2025-05-07T20:33:05.3103116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3103226Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3103345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3103468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3103595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3103674Z     )
2025-05-07T20:33:05.3103929Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3104034Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3104112Z         self,
2025-05-07T20:33:05.3104191Z         T: int,
2025-05-07T20:33:05.3104277Z         D: int,
2025-05-07T20:33:05.3104379Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3104482Z         contiguous: bool,
2025-05-07T20:33:05.3104573Z         compiled: bool,
2025-05-07T20:33:05.3104652Z     ) -> None:
2025-05-07T20:33:05.3104800Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3104876Z     
2025-05-07T20:33:05.3105092Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3105177Z     
2025-05-07T20:33:05.3105272Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3105401Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3105500Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3105582Z         x0 = x[:, :D]
2025-05-07T20:33:05.3105664Z         x1 = x[:, D:]
2025-05-07T20:33:05.3105747Z     
2025-05-07T20:33:05.3105833Z         if contiguous:
2025-05-07T20:33:05.3105928Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3106027Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3106101Z     
2025-05-07T20:33:05.3106201Z         if scale_ub is not None:
2025-05-07T20:33:05.3106311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3106452Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3106540Z             )
2025-05-07T20:33:05.3106619Z         else:
2025-05-07T20:33:05.3106720Z             scale_ub_tensor = None
2025-05-07T20:33:05.3106804Z     
2025-05-07T20:33:05.3106939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3107032Z             op = silu_mul_quant
2025-05-07T20:33:05.3107128Z             if compiled:
2025-05-07T20:33:05.3107231Z                 op = torch.compile(op)
2025-05-07T20:33:05.3107340Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3107424Z     
2025-05-07T20:33:05.3107521Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3107526Z 
2025-05-07T20:33:05.3107634Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3107770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3107876Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3107986Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3108369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3108472Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3108998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3109100Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3109480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3109713Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3110067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3110173Z     kernel = self.compile(
2025-05-07T20:33:05.3110569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3110802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3110984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3110989Z 
2025-05-07T20:33:05.3111203Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78015eb50>
2025-05-07T20:33:05.3112023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3112550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0d300>}
2025-05-07T20:33:05.3113962Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3114285Z context = <triton._C.libtriton.ir.context object at 0x7ff7801c4470>
2025-05-07T20:33:05.3114451Z 
2025-05-07T20:33:05.3114703Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3114991Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3115110Z                            module_map=module_map)
2025-05-07T20:33:05.3115283Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3115385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3115465Z E       ^
2025-05-07T20:33:05.3115840Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3115845Z 
2025-05-07T20:33:05.3116276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3116283Z 
2025-05-07T20:33:05.3116399Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3116640Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3116724Z     T=128,
2025-05-07T20:33:05.3116815Z     D=7168,
2025-05-07T20:33:05.3116903Z     scale_ub=1200.0,
2025-05-07T20:33:05.3116991Z     contiguous=False,
2025-05-07T20:33:05.3117090Z     compiled=True,
2025-05-07T20:33:05.3117167Z )
2025-05-07T20:33:05.3117392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3117577Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:05.3117582Z 
2025-05-07T20:33:05.3117661Z     @given(
2025-05-07T20:33:05.3117782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3117897Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3118017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3118155Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3118274Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3118353Z     )
2025-05-07T20:33:05.3118618Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3118716Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3118800Z         self,
2025-05-07T20:33:05.3118885Z         T: int,
2025-05-07T20:33:05.3118964Z         D: int,
2025-05-07T20:33:05.3119064Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3119163Z         contiguous: bool,
2025-05-07T20:33:05.3119251Z         compiled: bool,
2025-05-07T20:33:05.3119342Z     ) -> None:
2025-05-07T20:33:05.3119440Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3119516Z     
2025-05-07T20:33:05.3119703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3119779Z     
2025-05-07T20:33:05.3119871Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3120006Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3120280Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3120364Z         x0 = x[:, :D]
2025-05-07T20:33:05.3120454Z         x1 = x[:, D:]
2025-05-07T20:33:05.3120531Z     
2025-05-07T20:33:05.3120694Z         if contiguous:
2025-05-07T20:33:05.3120799Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3120889Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3120963Z     
2025-05-07T20:33:05.3121067Z         if scale_ub is not None:
2025-05-07T20:33:05.3121175Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3121322Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3121399Z             )
2025-05-07T20:33:05.3121479Z         else:
2025-05-07T20:33:05.3121582Z             scale_ub_tensor = None
2025-05-07T20:33:05.3121656Z     
2025-05-07T20:33:05.3121788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3121886Z             op = silu_mul_quant
2025-05-07T20:33:05.3121973Z             if compiled:
2025-05-07T20:33:05.3122079Z                 op = torch.compile(op)
2025-05-07T20:33:05.3122194Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3122312Z     
2025-05-07T20:33:05.3122407Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3122456Z 
2025-05-07T20:33:05.3122558Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3122688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3122798Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3122900Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3123283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3123386Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3123900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3124001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3124380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3124620Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3124982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3125079Z     kernel = self.compile(
2025-05-07T20:33:05.3125473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3125662Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3125793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3125797Z 
2025-05-07T20:33:05.3126015Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa29ad0>
2025-05-07T20:33:05.3126815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3127352Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0e160>}
2025-05-07T20:33:05.3128126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3128324Z context = <triton._C.libtriton.ir.context object at 0x7ff7801ef870>
2025-05-07T20:33:05.3128328Z 
2025-05-07T20:33:05.3128505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3128779Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3128888Z                            module_map=module_map)
2025-05-07T20:33:05.3129108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3129212Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3129298Z E       ^
2025-05-07T20:33:05.3129704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3129709Z 
2025-05-07T20:33:05.3130140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3130145Z 
2025-05-07T20:33:05.3130257Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3130486Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3130572Z     T=2048,
2025-05-07T20:33:05.3130653Z     D=7168,
2025-05-07T20:33:05.3130739Z     scale_ub=None,
2025-05-07T20:33:05.3130834Z     contiguous=True,
2025-05-07T20:33:05.3130921Z     compiled=True,
2025-05-07T20:33:05.3130997Z )
2025-05-07T20:33:05.3131229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3131449Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.3131457Z 
2025-05-07T20:33:05.3131535Z     @given(
2025-05-07T20:33:05.3131699Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3131803Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3131929Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3132052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3132167Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3132251Z     )
2025-05-07T20:33:05.3132504Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3132600Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3132684Z         self,
2025-05-07T20:33:05.3132763Z         T: int,
2025-05-07T20:33:05.3132840Z         D: int,
2025-05-07T20:33:05.3132951Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3133042Z         contiguous: bool,
2025-05-07T20:33:05.3133132Z         compiled: bool,
2025-05-07T20:33:05.3133218Z     ) -> None:
2025-05-07T20:33:05.3133318Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3133395Z     
2025-05-07T20:33:05.3133576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3133653Z     
2025-05-07T20:33:05.3133753Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3133880Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3133970Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3134056Z         x0 = x[:, :D]
2025-05-07T20:33:05.3134137Z         x1 = x[:, D:]
2025-05-07T20:33:05.3134212Z     
2025-05-07T20:33:05.3134304Z         if contiguous:
2025-05-07T20:33:05.3134396Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3134488Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3134567Z     
2025-05-07T20:33:05.3134660Z         if scale_ub is not None:
2025-05-07T20:33:05.3134770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3134915Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3134996Z             )
2025-05-07T20:33:05.3135081Z         else:
2025-05-07T20:33:05.3135178Z             scale_ub_tensor = None
2025-05-07T20:33:05.3135252Z     
2025-05-07T20:33:05.3135394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3135485Z             op = silu_mul_quant
2025-05-07T20:33:05.3135570Z             if compiled:
2025-05-07T20:33:05.3135678Z                 op = torch.compile(op)
2025-05-07T20:33:05.3135786Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3135859Z     
2025-05-07T20:33:05.3135957Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3135961Z 
2025-05-07T20:33:05.3136058Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3136193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3136343Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3136444Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3136870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3136970Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3137481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3137586Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3137954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3138194Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3138545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3138640Z     kernel = self.compile(
2025-05-07T20:33:05.3139039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3139265Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3139436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3139446Z 
2025-05-07T20:33:05.3139655Z self = <triton.compiler.compiler.ASTSource object at 0x7ff780a24550>
2025-05-07T20:33:05.3140456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3140984Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfd0f420>}
2025-05-07T20:33:05.3141751Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3141959Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfb56fb0>
2025-05-07T20:33:05.3141964Z 
2025-05-07T20:33:05.3142133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3142407Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3142521Z                            module_map=module_map)
2025-05-07T20:33:05.3142688Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3142788Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3142870Z E       ^
2025-05-07T20:33:05.3143236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3143240Z 
2025-05-07T20:33:05.3143673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3143680Z 
2025-05-07T20:33:05.3143789Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3144020Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3144102Z     T=16384,
2025-05-07T20:33:05.3144179Z     D=5120,
2025-05-07T20:33:05.3144267Z     scale_ub=None,
2025-05-07T20:33:05.3144354Z     contiguous=False,
2025-05-07T20:33:05.3144441Z     compiled=False,
2025-05-07T20:33:05.3144519Z )
2025-05-07T20:33:05.3144742Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3144923Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.3144928Z 
2025-05-07T20:33:05.3145009Z     @given(
2025-05-07T20:33:05.3145129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3145228Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3145346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3145537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3145657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3145731Z     )
2025-05-07T20:33:05.3146026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3146126Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3146203Z         self,
2025-05-07T20:33:05.3146279Z         T: int,
2025-05-07T20:33:05.3146362Z         D: int,
2025-05-07T20:33:05.3146459Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3146554Z         contiguous: bool,
2025-05-07T20:33:05.3146639Z         compiled: bool,
2025-05-07T20:33:05.3146717Z     ) -> None:
2025-05-07T20:33:05.3146816Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3146888Z     
2025-05-07T20:33:05.3147061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3147139Z     
2025-05-07T20:33:05.3147232Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3147361Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3149319Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3149326Z 
2025-05-07T20:33:05.3149448Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.3149453Z 
2025-05-07T20:33:05.3149560Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3149787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3149868Z     T=4096,
2025-05-07T20:33:05.3149945Z     D=7168,
2025-05-07T20:33:05.3150027Z     scale_ub=1200.0,
2025-05-07T20:33:05.3150118Z     contiguous=True,
2025-05-07T20:33:05.3150201Z     compiled=True,
2025-05-07T20:33:05.3150279Z )
2025-05-07T20:33:05.3150506Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3150681Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.3150686Z 
2025-05-07T20:33:05.3150762Z     @given(
2025-05-07T20:33:05.3150884Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3150982Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3151101Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3151219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3151332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3151410Z     )
2025-05-07T20:33:05.3151661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3151758Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3151841Z         self,
2025-05-07T20:33:05.3151917Z         T: int,
2025-05-07T20:33:05.3151996Z         D: int,
2025-05-07T20:33:05.3152102Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3152192Z         contiguous: bool,
2025-05-07T20:33:05.3152278Z         compiled: bool,
2025-05-07T20:33:05.3152361Z     ) -> None:
2025-05-07T20:33:05.3152455Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3152526Z     
2025-05-07T20:33:05.3152700Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3152775Z     
2025-05-07T20:33:05.3152875Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3152998Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3154887Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3154938Z 
2025-05-07T20:33:05.3155059Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.3155063Z 
2025-05-07T20:33:05.3155167Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3155399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3155477Z     T=16384,
2025-05-07T20:33:05.3155555Z     D=7168,
2025-05-07T20:33:05.3155643Z     scale_ub=None,
2025-05-07T20:33:05.3155730Z     contiguous=False,
2025-05-07T20:33:05.3155815Z     compiled=False,
2025-05-07T20:33:05.3155893Z )
2025-05-07T20:33:05.3156116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3156312Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.3156356Z 
2025-05-07T20:33:05.3156439Z     @given(
2025-05-07T20:33:05.3156601Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3156703Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3156817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3156939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3157052Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3157127Z     )
2025-05-07T20:33:05.3157381Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3157474Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3157553Z         self,
2025-05-07T20:33:05.3157630Z         T: int,
2025-05-07T20:33:05.3157707Z         D: int,
2025-05-07T20:33:05.3157807Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3157901Z         contiguous: bool,
2025-05-07T20:33:05.3157986Z         compiled: bool,
2025-05-07T20:33:05.3158071Z     ) -> None:
2025-05-07T20:33:05.3158167Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3158242Z     
2025-05-07T20:33:05.3158421Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3160348Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3160354Z 
2025-05-07T20:33:05.3160480Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3160484Z 
2025-05-07T20:33:05.3160586Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3160822Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3160903Z     T=2048,
2025-05-07T20:33:05.3160980Z     D=7168,
2025-05-07T20:33:05.3161069Z     scale_ub=1200.0,
2025-05-07T20:33:05.3161153Z     contiguous=True,
2025-05-07T20:33:05.3161236Z     compiled=True,
2025-05-07T20:33:05.3161311Z )
2025-05-07T20:33:05.3161531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3161704Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.3161714Z 
2025-05-07T20:33:05.3161790Z     @given(
2025-05-07T20:33:05.3161908Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3162014Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3162128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3162294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3162415Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3162491Z     )
2025-05-07T20:33:05.3162781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3162880Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3162959Z         self,
2025-05-07T20:33:05.3163035Z         T: int,
2025-05-07T20:33:05.3163115Z         D: int,
2025-05-07T20:33:05.3163212Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3163306Z         contiguous: bool,
2025-05-07T20:33:05.3163391Z         compiled: bool,
2025-05-07T20:33:05.3163468Z     ) -> None:
2025-05-07T20:33:05.3163567Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3163641Z     
2025-05-07T20:33:05.3163811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3163889Z     
2025-05-07T20:33:05.3163981Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3164111Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3166033Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3166074Z 
2025-05-07T20:33:05.3166195Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.3166200Z 
2025-05-07T20:33:05.3166306Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3166533Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3166615Z     T=2048,
2025-05-07T20:33:05.3166696Z     D=7168,
2025-05-07T20:33:05.3166780Z     scale_ub=None,
2025-05-07T20:33:05.3166871Z     contiguous=True,
2025-05-07T20:33:05.3166958Z     compiled=False,
2025-05-07T20:33:05.3167030Z )
2025-05-07T20:33:05.3167261Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3167436Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3167440Z 
2025-05-07T20:33:05.3167515Z     @given(
2025-05-07T20:33:05.3167643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3167743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3167862Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3167980Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3168093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3168168Z     )
2025-05-07T20:33:05.3168420Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3168516Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3168596Z         self,
2025-05-07T20:33:05.3168674Z         T: int,
2025-05-07T20:33:05.3168750Z         D: int,
2025-05-07T20:33:05.3168855Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3168947Z         contiguous: bool,
2025-05-07T20:33:05.3169033Z         compiled: bool,
2025-05-07T20:33:05.3169115Z     ) -> None:
2025-05-07T20:33:05.3169210Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3169287Z     
2025-05-07T20:33:05.3169456Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3169529Z     
2025-05-07T20:33:05.3169628Z >       x_sign = torch.sign(x)
2025-05-07T20:33:05.3171508Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3171552Z 
2025-05-07T20:33:05.3171676Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:05.3171681Z 
2025-05-07T20:33:05.3171784Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3172012Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3172094Z     T=1,
2025-05-07T20:33:05.3172171Z     D=7168,
2025-05-07T20:33:05.3172256Z     scale_ub=1200.0,
2025-05-07T20:33:05.3172349Z     contiguous=True,
2025-05-07T20:33:05.3172434Z     compiled=False,
2025-05-07T20:33:05.3172507Z )
2025-05-07T20:33:05.3172736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3172904Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3172911Z 
2025-05-07T20:33:05.3172991Z     @given(
2025-05-07T20:33:05.3173151Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3173257Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3173438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3173556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3173669Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3173747Z     )
2025-05-07T20:33:05.3173998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3174094Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3174170Z         self,
2025-05-07T20:33:05.3174247Z         T: int,
2025-05-07T20:33:05.3174325Z         D: int,
2025-05-07T20:33:05.3174422Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3174512Z         contiguous: bool,
2025-05-07T20:33:05.3174602Z         compiled: bool,
2025-05-07T20:33:05.3174681Z     ) -> None:
2025-05-07T20:33:05.3174775Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3174855Z     
2025-05-07T20:33:05.3175028Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3175102Z     
2025-05-07T20:33:05.3175199Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3175324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3175413Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3175498Z         x0 = x[:, :D]
2025-05-07T20:33:05.3175579Z         x1 = x[:, D:]
2025-05-07T20:33:05.3175655Z     
2025-05-07T20:33:05.3175741Z         if contiguous:
2025-05-07T20:33:05.3175834Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3175928Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3176001Z     
2025-05-07T20:33:05.3176090Z         if scale_ub is not None:
2025-05-07T20:33:05.3176201Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3176337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3176417Z             )
2025-05-07T20:33:05.3176496Z         else:
2025-05-07T20:33:05.3176594Z             scale_ub_tensor = None
2025-05-07T20:33:05.3176666Z     
2025-05-07T20:33:05.3176806Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3176896Z             op = silu_mul_quant
2025-05-07T20:33:05.3176984Z             if compiled:
2025-05-07T20:33:05.3177084Z                 op = torch.compile(op)
2025-05-07T20:33:05.3177189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3177267Z     
2025-05-07T20:33:05.3177362Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3177366Z 
2025-05-07T20:33:05.3177463Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3177598Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3177701Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3177801Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3178320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3178470Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3178891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3179123Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3179475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3179575Z     kernel = self.compile(
2025-05-07T20:33:05.3179972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3180158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3180289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3180293Z 
2025-05-07T20:33:05.3180507Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfc09cd0>
2025-05-07T20:33:05.3181357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3181931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfbc22a0>}
2025-05-07T20:33:05.3182709Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3182907Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfa7ddb0>
2025-05-07T20:33:05.3182911Z 
2025-05-07T20:33:05.3183079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3183359Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3183470Z                            module_map=module_map)
2025-05-07T20:33:05.3183640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3183737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3183813Z E       ^
2025-05-07T20:33:05.3184180Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3184185Z 
2025-05-07T20:33:05.3184609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3184614Z 
2025-05-07T20:33:05.3184722Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3184949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3185027Z     T=128,
2025-05-07T20:33:05.3185109Z     D=5120,
2025-05-07T20:33:05.3185192Z     scale_ub=None,
2025-05-07T20:33:05.3185277Z     contiguous=True,
2025-05-07T20:33:05.3185363Z     compiled=False,
2025-05-07T20:33:05.3185439Z )
2025-05-07T20:33:05.3185666Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3185840Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3185845Z 
2025-05-07T20:33:05.3185922Z     @given(
2025-05-07T20:33:05.3186047Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3186147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3186261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3186383Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3186497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3186572Z     )
2025-05-07T20:33:05.3186829Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3186921Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3187044Z         self,
2025-05-07T20:33:05.3187122Z         T: int,
2025-05-07T20:33:05.3187201Z         D: int,
2025-05-07T20:33:05.3187299Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3187433Z         contiguous: bool,
2025-05-07T20:33:05.3187519Z         compiled: bool,
2025-05-07T20:33:05.3187604Z     ) -> None:
2025-05-07T20:33:05.3187700Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3187772Z     
2025-05-07T20:33:05.3187947Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3188021Z     
2025-05-07T20:33:05.3188113Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3188240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3188327Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3188407Z         x0 = x[:, :D]
2025-05-07T20:33:05.3188496Z         x1 = x[:, D:]
2025-05-07T20:33:05.3188566Z     
2025-05-07T20:33:05.3188650Z         if contiguous:
2025-05-07T20:33:05.3188749Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3188837Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3188950Z     
2025-05-07T20:33:05.3189045Z         if scale_ub is not None:
2025-05-07T20:33:05.3189155Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3189332Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3189410Z             )
2025-05-07T20:33:05.3189488Z         else:
2025-05-07T20:33:05.3189586Z             scale_ub_tensor = None
2025-05-07T20:33:05.3189657Z     
2025-05-07T20:33:05.3189787Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3189881Z             op = silu_mul_quant
2025-05-07T20:33:05.3189965Z             if compiled:
2025-05-07T20:33:05.3190064Z                 op = torch.compile(op)
2025-05-07T20:33:05.3190175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3190245Z     
2025-05-07T20:33:05.3190336Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3190348Z 
2025-05-07T20:33:05.3190447Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3190576Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3190684Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3190788Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3191304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3191406Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3191776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3192009Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3192362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3192456Z     kernel = self.compile(
2025-05-07T20:33:05.3192853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3193032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3193164Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3193169Z 
2025-05-07T20:33:05.3193380Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa7b8d0>
2025-05-07T20:33:05.3194176Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3194704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfbc31a0>}
2025-05-07T20:33:05.3195471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3195750Z context = <triton._C.libtriton.ir.context object at 0x7ff5cfad27f0>
2025-05-07T20:33:05.3195792Z 
2025-05-07T20:33:05.3195965Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3196238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3196350Z                            module_map=module_map)
2025-05-07T20:33:05.3196514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3196613Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3196696Z E       ^
2025-05-07T20:33:05.3197059Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3197064Z 
2025-05-07T20:33:05.3197494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3197501Z 
2025-05-07T20:33:05.3197605Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3197881Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3198002Z     T=128,
2025-05-07T20:33:05.3198078Z     D=7168,
2025-05-07T20:33:05.3198161Z     scale_ub=None,
2025-05-07T20:33:05.3198250Z     contiguous=True,
2025-05-07T20:33:05.3198333Z     compiled=False,
2025-05-07T20:33:05.3198410Z )
2025-05-07T20:33:05.3198631Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3198803Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3198807Z 
2025-05-07T20:33:05.3198886Z     @given(
2025-05-07T20:33:05.3199003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3199101Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3199220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3199340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3199451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3199532Z     )
2025-05-07T20:33:05.3199791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3199885Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3199961Z         self,
2025-05-07T20:33:05.3200035Z         T: int,
2025-05-07T20:33:05.3200190Z         D: int,
2025-05-07T20:33:05.3200293Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3200386Z         contiguous: bool,
2025-05-07T20:33:05.3200477Z         compiled: bool,
2025-05-07T20:33:05.3200553Z     ) -> None:
2025-05-07T20:33:05.3200649Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3200723Z     
2025-05-07T20:33:05.3200903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3200977Z     
2025-05-07T20:33:05.3201072Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3201200Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3201291Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3201374Z         x0 = x[:, :D]
2025-05-07T20:33:05.3201456Z         x1 = x[:, D:]
2025-05-07T20:33:05.3201530Z     
2025-05-07T20:33:05.3201616Z         if contiguous:
2025-05-07T20:33:05.3201706Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3201799Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3201873Z     
2025-05-07T20:33:05.3201964Z         if scale_ub is not None:
2025-05-07T20:33:05.3202083Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3202221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3202296Z             )
2025-05-07T20:33:05.3202376Z         else:
2025-05-07T20:33:05.3202470Z             scale_ub_tensor = None
2025-05-07T20:33:05.3202550Z     
2025-05-07T20:33:05.3202682Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3202773Z             op = silu_mul_quant
2025-05-07T20:33:05.3202909Z             if compiled:
2025-05-07T20:33:05.3203016Z                 op = torch.compile(op)
2025-05-07T20:33:05.3203123Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3203236Z     
2025-05-07T20:33:05.3203329Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3203334Z 
2025-05-07T20:33:05.3203432Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3203561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3203660Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3203766Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3204333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3204429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3204804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3205038Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3205390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3205593Z     kernel = self.compile(
2025-05-07T20:33:05.3205990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3206173Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3206301Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3206305Z 
2025-05-07T20:33:05.3206514Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfc0b350>
2025-05-07T20:33:05.3207326Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3207855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfaac040>}
2025-05-07T20:33:05.3208640Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3208839Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf8118f0>
2025-05-07T20:33:05.3208843Z 
2025-05-07T20:33:05.3209012Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3209290Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3209398Z                            module_map=module_map)
2025-05-07T20:33:05.3209567Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3209665Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3209744Z E       ^
2025-05-07T20:33:05.3210118Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3210129Z 
2025-05-07T20:33:05.3210561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3210566Z 
2025-05-07T20:33:05.3210672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3210901Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3210978Z     T=2048,
2025-05-07T20:33:05.3211056Z     D=7168,
2025-05-07T20:33:05.3211140Z     scale_ub=1200.0,
2025-05-07T20:33:05.3211226Z     contiguous=True,
2025-05-07T20:33:05.3211320Z     compiled=False,
2025-05-07T20:33:05.3211395Z )
2025-05-07T20:33:05.3211616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3211797Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3211846Z 
2025-05-07T20:33:05.3211923Z     @given(
2025-05-07T20:33:05.3212048Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3212187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3212305Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3212426Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3212542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3212615Z     )
2025-05-07T20:33:05.3212873Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3212966Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3213040Z         self,
2025-05-07T20:33:05.3213121Z         T: int,
2025-05-07T20:33:05.3213195Z         D: int,
2025-05-07T20:33:05.3213297Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3213700Z         contiguous: bool,
2025-05-07T20:33:05.3213793Z         compiled: bool,
2025-05-07T20:33:05.3213882Z     ) -> None:
2025-05-07T20:33:05.3213978Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3214140Z     
2025-05-07T20:33:05.3214319Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3216224Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3216231Z 
2025-05-07T20:33:05.3216354Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3216359Z 
2025-05-07T20:33:05.3216461Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3216693Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3216775Z     T=1,
2025-05-07T20:33:05.3216849Z     D=5120,
2025-05-07T20:33:05.3216940Z     scale_ub=1200.0,
2025-05-07T20:33:05.3217028Z     contiguous=True,
2025-05-07T20:33:05.3217111Z     compiled=False,
2025-05-07T20:33:05.3217187Z )
2025-05-07T20:33:05.3217410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3217579Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3217583Z 
2025-05-07T20:33:05.3217663Z     @given(
2025-05-07T20:33:05.3217781Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3217880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3221320Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3221466Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3221587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3221666Z     )
2025-05-07T20:33:05.3221926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3222028Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3222107Z         self,
2025-05-07T20:33:05.3222191Z         T: int,
2025-05-07T20:33:05.3222268Z         D: int,
2025-05-07T20:33:05.3222366Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3222460Z         contiguous: bool,
2025-05-07T20:33:05.3222547Z         compiled: bool,
2025-05-07T20:33:05.3222631Z     ) -> None:
2025-05-07T20:33:05.3222727Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3222798Z     
2025-05-07T20:33:05.3222975Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3223047Z     
2025-05-07T20:33:05.3223139Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3223267Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3223357Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3223529Z         x0 = x[:, :D]
2025-05-07T20:33:05.3223612Z         x1 = x[:, D:]
2025-05-07T20:33:05.3223685Z     
2025-05-07T20:33:05.3223768Z         if contiguous:
2025-05-07T20:33:05.3223922Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3224019Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3224095Z     
2025-05-07T20:33:05.3224211Z         if scale_ub is not None:
2025-05-07T20:33:05.3224333Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3224488Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3224562Z             )
2025-05-07T20:33:05.3224639Z         else:
2025-05-07T20:33:05.3224744Z             scale_ub_tensor = None
2025-05-07T20:33:05.3224814Z     
2025-05-07T20:33:05.3224947Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3225042Z             op = silu_mul_quant
2025-05-07T20:33:05.3225128Z             if compiled:
2025-05-07T20:33:05.3225227Z                 op = torch.compile(op)
2025-05-07T20:33:05.3225341Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3225413Z     
2025-05-07T20:33:05.3225555Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3225561Z 
2025-05-07T20:33:05.3225661Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3225828Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3225935Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3226035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3226555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3226655Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3227025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3227264Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3227615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3227715Z     kernel = self.compile(
2025-05-07T20:33:05.3228129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3228312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3228441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3228448Z 
2025-05-07T20:33:05.3228656Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78015f750>
2025-05-07T20:33:05.3229456Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3229983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cfaad580>}
2025-05-07T20:33:05.3230755Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3230960Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf8f0270>
2025-05-07T20:33:05.3230965Z 
2025-05-07T20:33:05.3231131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3231402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3231516Z                            module_map=module_map)
2025-05-07T20:33:05.3231681Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3231779Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3231861Z E       ^
2025-05-07T20:33:05.3232223Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3232272Z 
2025-05-07T20:33:05.3232744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3232752Z 
2025-05-07T20:33:05.3232859Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3233089Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3233172Z     T=2048,
2025-05-07T20:33:05.3233246Z     D=5120,
2025-05-07T20:33:05.3233333Z     scale_ub=None,
2025-05-07T20:33:05.3233419Z     contiguous=True,
2025-05-07T20:33:05.3233502Z     compiled=False,
2025-05-07T20:33:05.3233581Z )
2025-05-07T20:33:05.3233804Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3233978Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3233983Z 
2025-05-07T20:33:05.3234068Z     @given(
2025-05-07T20:33:05.3234192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3234294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3234484Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3234662Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3234782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3234857Z     )
2025-05-07T20:33:05.3235111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3235208Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3235285Z         self,
2025-05-07T20:33:05.3235359Z         T: int,
2025-05-07T20:33:05.3235440Z         D: int,
2025-05-07T20:33:05.3235538Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3235628Z         contiguous: bool,
2025-05-07T20:33:05.3235717Z         compiled: bool,
2025-05-07T20:33:05.3235795Z     ) -> None:
2025-05-07T20:33:05.3235892Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3235969Z     
2025-05-07T20:33:05.3236144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3236226Z     
2025-05-07T20:33:05.3236319Z >       x_sign = torch.sign(x)
2025-05-07T20:33:05.3238169Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3238180Z 
2025-05-07T20:33:05.3238301Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:05.3238305Z 
2025-05-07T20:33:05.3238412Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3238646Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3238723Z     T=16384,
2025-05-07T20:33:05.3238805Z     D=5120,
2025-05-07T20:33:05.3238890Z     scale_ub=None,
2025-05-07T20:33:05.3238981Z     contiguous=True,
2025-05-07T20:33:05.3239068Z     compiled=False,
2025-05-07T20:33:05.3239144Z )
2025-05-07T20:33:05.3239366Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3239550Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3239555Z 
2025-05-07T20:33:05.3239630Z     @given(
2025-05-07T20:33:05.3239750Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3239853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3239969Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3240179Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3240302Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3240429Z     )
2025-05-07T20:33:05.3240681Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3240782Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3240923Z         self,
2025-05-07T20:33:05.3241006Z         T: int,
2025-05-07T20:33:05.3241081Z         D: int,
2025-05-07T20:33:05.3241180Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3241276Z         contiguous: bool,
2025-05-07T20:33:05.3241360Z         compiled: bool,
2025-05-07T20:33:05.3241437Z     ) -> None:
2025-05-07T20:33:05.3241534Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3241607Z     
2025-05-07T20:33:05.3241777Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3243661Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3243705Z 
2025-05-07T20:33:05.3243826Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3243831Z 
2025-05-07T20:33:05.3243937Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3244163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3244242Z     T=4096,
2025-05-07T20:33:05.3244317Z     D=5120,
2025-05-07T20:33:05.3244399Z     scale_ub=None,
2025-05-07T20:33:05.3244487Z     contiguous=True,
2025-05-07T20:33:05.3244570Z     compiled=False,
2025-05-07T20:33:05.3244644Z )
2025-05-07T20:33:05.3244868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3245045Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3245049Z 
2025-05-07T20:33:05.3245127Z     @given(
2025-05-07T20:33:05.3245253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3245354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3245466Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3245586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3245701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3245776Z     )
2025-05-07T20:33:05.3246027Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3246122Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3246199Z         self,
2025-05-07T20:33:05.3246273Z         T: int,
2025-05-07T20:33:05.3246351Z         D: int,
2025-05-07T20:33:05.3246450Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3246537Z         contiguous: bool,
2025-05-07T20:33:05.3246623Z         compiled: bool,
2025-05-07T20:33:05.3246705Z     ) -> None:
2025-05-07T20:33:05.3246799Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3246873Z     
2025-05-07T20:33:05.3247048Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3248876Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3248885Z 
2025-05-07T20:33:05.3249001Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3249005Z 
2025-05-07T20:33:05.3249155Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3249384Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3249464Z     T=2048,
2025-05-07T20:33:05.3249577Z     D=5120,
2025-05-07T20:33:05.3249666Z     scale_ub=None,
2025-05-07T20:33:05.3249752Z     contiguous=False,
2025-05-07T20:33:05.3249834Z     compiled=False,
2025-05-07T20:33:05.3249909Z )
2025-05-07T20:33:05.3250128Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3250307Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.3250311Z 
2025-05-07T20:33:05.3250388Z     @given(
2025-05-07T20:33:05.3250503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3250606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3250719Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3250835Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3250953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3251026Z     )
2025-05-07T20:33:05.3251318Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3251454Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3251532Z         self,
2025-05-07T20:33:05.3251609Z         T: int,
2025-05-07T20:33:05.3251683Z         D: int,
2025-05-07T20:33:05.3251779Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3251869Z         contiguous: bool,
2025-05-07T20:33:05.3251954Z         compiled: bool,
2025-05-07T20:33:05.3252030Z     ) -> None:
2025-05-07T20:33:05.3252127Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3252198Z     
2025-05-07T20:33:05.3252367Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3254199Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3254211Z 
2025-05-07T20:33:05.3254329Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3254333Z 
2025-05-07T20:33:05.3254440Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3254665Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3254744Z     T=4096,
2025-05-07T20:33:05.3254820Z     D=7168,
2025-05-07T20:33:05.3254901Z     scale_ub=None,
2025-05-07T20:33:05.3254990Z     contiguous=True,
2025-05-07T20:33:05.3255073Z     compiled=True,
2025-05-07T20:33:05.3255146Z )
2025-05-07T20:33:05.3255369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3255542Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.3255550Z 
2025-05-07T20:33:05.3255629Z     @given(
2025-05-07T20:33:05.3255758Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3255856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3255968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3256087Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3256200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3256276Z     )
2025-05-07T20:33:05.3256524Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3256616Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3256692Z         self,
2025-05-07T20:33:05.3256766Z         T: int,
2025-05-07T20:33:05.3256841Z         D: int,
2025-05-07T20:33:05.3256941Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3257079Z         contiguous: bool,
2025-05-07T20:33:05.3257163Z         compiled: bool,
2025-05-07T20:33:05.3257247Z     ) -> None:
2025-05-07T20:33:05.3257341Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3257453Z     
2025-05-07T20:33:05.3257628Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3259452Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3259460Z 
2025-05-07T20:33:05.3259579Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3259586Z 
2025-05-07T20:33:05.3259688Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3259962Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3260078Z     T=2048,
2025-05-07T20:33:05.3260155Z     D=5120,
2025-05-07T20:33:05.3260242Z     scale_ub=1200.0,
2025-05-07T20:33:05.3260327Z     contiguous=False,
2025-05-07T20:33:05.3260409Z     compiled=False,
2025-05-07T20:33:05.3260489Z )
2025-05-07T20:33:05.3260710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3260891Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.3260895Z 
2025-05-07T20:33:05.3260972Z     @given(
2025-05-07T20:33:05.3261089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3261188Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3261300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3261417Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3261530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3261606Z     )
2025-05-07T20:33:05.3261860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3261956Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3262033Z         self,
2025-05-07T20:33:05.3262115Z         T: int,
2025-05-07T20:33:05.3262190Z         D: int,
2025-05-07T20:33:05.3262288Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3262380Z         contiguous: bool,
2025-05-07T20:33:05.3262464Z         compiled: bool,
2025-05-07T20:33:05.3262540Z     ) -> None:
2025-05-07T20:33:05.3262636Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3262709Z     
2025-05-07T20:33:05.3262883Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3264767Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3264778Z 
2025-05-07T20:33:05.3264898Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3264902Z 
2025-05-07T20:33:05.3265006Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3265232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3265311Z     T=4096,
2025-05-07T20:33:05.3265386Z     D=7168,
2025-05-07T20:33:05.3265468Z     scale_ub=1200.0,
2025-05-07T20:33:05.3265553Z     contiguous=True,
2025-05-07T20:33:05.3265635Z     compiled=False,
2025-05-07T20:33:05.3265752Z )
2025-05-07T20:33:05.3265975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3266189Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3266195Z 
2025-05-07T20:33:05.3266273Z     @given(
2025-05-07T20:33:05.3266394Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3266491Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3266604Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3266723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3266835Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3266908Z     )
2025-05-07T20:33:05.3267156Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3267249Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3267328Z         self,
2025-05-07T20:33:05.3267402Z         T: int,
2025-05-07T20:33:05.3267478Z         D: int,
2025-05-07T20:33:05.3267578Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3267666Z         contiguous: bool,
2025-05-07T20:33:05.3267793Z         compiled: bool,
2025-05-07T20:33:05.3267877Z     ) -> None:
2025-05-07T20:33:05.3268009Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3268080Z     
2025-05-07T20:33:05.3268252Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3270076Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3270090Z 
2025-05-07T20:33:05.3270205Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3270214Z 
2025-05-07T20:33:05.3270315Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3270548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3270624Z     T=16384,
2025-05-07T20:33:05.3270702Z     D=7168,
2025-05-07T20:33:05.3270786Z     scale_ub=None,
2025-05-07T20:33:05.3270872Z     contiguous=False,
2025-05-07T20:33:05.3270952Z     compiled=True,
2025-05-07T20:33:05.3271026Z )
2025-05-07T20:33:05.3271245Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3271424Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:05.3271429Z 
2025-05-07T20:33:05.3271503Z     @given(
2025-05-07T20:33:05.3271620Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3271722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3271837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3271952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3272072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3272146Z     )
2025-05-07T20:33:05.3272399Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3272493Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3272569Z         self,
2025-05-07T20:33:05.3272646Z         T: int,
2025-05-07T20:33:05.3272720Z         D: int,
2025-05-07T20:33:05.3272817Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3272909Z         contiguous: bool,
2025-05-07T20:33:05.3272997Z         compiled: bool,
2025-05-07T20:33:05.3273074Z     ) -> None:
2025-05-07T20:33:05.3273171Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3273241Z     
2025-05-07T20:33:05.3273410Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3275317Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3275360Z 
2025-05-07T20:33:05.3275481Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3275486Z 
2025-05-07T20:33:05.3275587Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3275812Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3275891Z     T=4096,
2025-05-07T20:33:05.3275966Z     D=7168,
2025-05-07T20:33:05.3276047Z     scale_ub=None,
2025-05-07T20:33:05.3276137Z     contiguous=True,
2025-05-07T20:33:05.3276220Z     compiled=False,
2025-05-07T20:33:05.3276292Z )
2025-05-07T20:33:05.3276558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3276769Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3276774Z 
2025-05-07T20:33:05.3276853Z     @given(
2025-05-07T20:33:05.3276971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3277067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3277181Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3277298Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3277410Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3277484Z     )
2025-05-07T20:33:05.3277734Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3277828Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3277905Z         self,
2025-05-07T20:33:05.3277982Z         T: int,
2025-05-07T20:33:05.3278058Z         D: int,
2025-05-07T20:33:05.3278157Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3278244Z         contiguous: bool,
2025-05-07T20:33:05.3278334Z         compiled: bool,
2025-05-07T20:33:05.3278413Z     ) -> None:
2025-05-07T20:33:05.3278506Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3278579Z     
2025-05-07T20:33:05.3278748Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3280666Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3280684Z 
2025-05-07T20:33:05.3280802Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3280809Z 
2025-05-07T20:33:05.3280912Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3281142Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3281218Z     T=16384,
2025-05-07T20:33:05.3281299Z     D=7168,
2025-05-07T20:33:05.3281381Z     scale_ub=None,
2025-05-07T20:33:05.3281465Z     contiguous=True,
2025-05-07T20:33:05.3281549Z     compiled=False,
2025-05-07T20:33:05.3281621Z )
2025-05-07T20:33:05.3281840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3282022Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:05.3282026Z 
2025-05-07T20:33:05.3282101Z     @given(
2025-05-07T20:33:05.3282216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3282367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3282478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3282636Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3282751Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3282824Z     )
2025-05-07T20:33:05.3283081Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3283173Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3283248Z         self,
2025-05-07T20:33:05.3283327Z         T: int,
2025-05-07T20:33:05.3283401Z         D: int,
2025-05-07T20:33:05.3283498Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3283592Z         contiguous: bool,
2025-05-07T20:33:05.3283678Z         compiled: bool,
2025-05-07T20:33:05.3283757Z     ) -> None:
2025-05-07T20:33:05.3283849Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3283924Z     
2025-05-07T20:33:05.3284091Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3285960Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3286006Z 
2025-05-07T20:33:05.3286125Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3286129Z 
2025-05-07T20:33:05.3286231Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3286458Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3286534Z     T=16384,
2025-05-07T20:33:05.3286616Z     D=7168,
2025-05-07T20:33:05.3286702Z     scale_ub=1200.0,
2025-05-07T20:33:05.3286784Z     contiguous=True,
2025-05-07T20:33:05.3286870Z     compiled=False,
2025-05-07T20:33:05.3286943Z )
2025-05-07T20:33:05.3287166Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3287346Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3287350Z 
2025-05-07T20:33:05.3287426Z     @given(
2025-05-07T20:33:05.3287542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3287642Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3287754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3287868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3287983Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3288055Z     )
2025-05-07T20:33:05.3288303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3288401Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3288475Z         self,
2025-05-07T20:33:05.3288556Z         T: int,
2025-05-07T20:33:05.3288629Z         D: int,
2025-05-07T20:33:05.3288727Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3288818Z         contiguous: bool,
2025-05-07T20:33:05.3288901Z         compiled: bool,
2025-05-07T20:33:05.3288976Z     ) -> None:
2025-05-07T20:33:05.3289073Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3289144Z     
2025-05-07T20:33:05.3289314Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3291145Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3291200Z 
2025-05-07T20:33:05.3291355Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3291360Z 
2025-05-07T20:33:05.3291465Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3291690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3291769Z     T=128,
2025-05-07T20:33:05.3291845Z     D=5120,
2025-05-07T20:33:05.3291926Z     scale_ub=1200.0,
2025-05-07T20:33:05.3292012Z     contiguous=False,
2025-05-07T20:33:05.3292095Z     compiled=False,
2025-05-07T20:33:05.3292166Z )
2025-05-07T20:33:05.3292387Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3292562Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:05.3292566Z 
2025-05-07T20:33:05.3292643Z     @given(
2025-05-07T20:33:05.3292765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3292862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3293023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3293174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3293287Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3293362Z     )
2025-05-07T20:33:05.3293611Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3293702Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3293781Z         self,
2025-05-07T20:33:05.3293856Z         T: int,
2025-05-07T20:33:05.3293931Z         D: int,
2025-05-07T20:33:05.3294031Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3294144Z         contiguous: bool,
2025-05-07T20:33:05.3294233Z         compiled: bool,
2025-05-07T20:33:05.3294330Z     ) -> None:
2025-05-07T20:33:05.3294426Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3294504Z     
2025-05-07T20:33:05.3294673Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3294748Z     
2025-05-07T20:33:05.3294844Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3294975Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3295063Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3295144Z         x0 = x[:, :D]
2025-05-07T20:33:05.3295223Z         x1 = x[:, D:]
2025-05-07T20:33:05.3295294Z     
2025-05-07T20:33:05.3295380Z         if contiguous:
2025-05-07T20:33:05.3295469Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3295558Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3295631Z     
2025-05-07T20:33:05.3295719Z         if scale_ub is not None:
2025-05-07T20:33:05.3295824Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3295962Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3296036Z             )
2025-05-07T20:33:05.3296115Z         else:
2025-05-07T20:33:05.3296211Z             scale_ub_tensor = None
2025-05-07T20:33:05.3296280Z     
2025-05-07T20:33:05.3296416Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3296507Z             op = silu_mul_quant
2025-05-07T20:33:05.3296592Z             if compiled:
2025-05-07T20:33:05.3296694Z                 op = torch.compile(op)
2025-05-07T20:33:05.3296798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3296868Z     
2025-05-07T20:33:05.3296961Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3296965Z 
2025-05-07T20:33:05.3297063Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3297195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3297294Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3297392Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3297907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3298050Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3298417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3298693Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3299046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3299144Z     kernel = self.compile(
2025-05-07T20:33:05.3299534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3299711Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3299840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3299845Z 
2025-05-07T20:33:05.3300054Z self = <triton.compiler.compiler.ASTSource object at 0x7ff5cfa2afd0>
2025-05-07T20:33:05.3300857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3301477Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cf7e11c0>}
2025-05-07T20:33:05.3302249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3302447Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf98fa70>
2025-05-07T20:33:05.3302451Z 
2025-05-07T20:33:05.3302617Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3302891Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3303000Z                            module_map=module_map)
2025-05-07T20:33:05.3303166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3303270Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3303347Z E       ^
2025-05-07T20:33:05.3303708Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3303717Z 
2025-05-07T20:33:05.3304141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3304146Z 
2025-05-07T20:33:05.3304247Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3304477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3304554Z     T=2048,
2025-05-07T20:33:05.3304628Z     D=7168,
2025-05-07T20:33:05.3304714Z     scale_ub=None,
2025-05-07T20:33:05.3304799Z     contiguous=False,
2025-05-07T20:33:05.3304884Z     compiled=False,
2025-05-07T20:33:05.3304961Z )
2025-05-07T20:33:05.3305181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3305369Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:05.3305373Z 
2025-05-07T20:33:05.3305450Z     @given(
2025-05-07T20:33:05.3305569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3305668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3305781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3305896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3306012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3306084Z     )
2025-05-07T20:33:05.3306337Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3306429Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3306503Z         self,
2025-05-07T20:33:05.3306627Z         T: int,
2025-05-07T20:33:05.3306702Z         D: int,
2025-05-07T20:33:05.3306798Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3306891Z         contiguous: bool,
2025-05-07T20:33:05.3307012Z         compiled: bool,
2025-05-07T20:33:05.3307090Z     ) -> None:
2025-05-07T20:33:05.3307191Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3307261Z     
2025-05-07T20:33:05.3307431Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3309266Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3309274Z 
2025-05-07T20:33:05.3309392Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3309437Z 
2025-05-07T20:33:05.3309544Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3309808Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3309888Z     T=128,
2025-05-07T20:33:05.3309964Z     D=7168,
2025-05-07T20:33:05.3310048Z     scale_ub=1200.0,
2025-05-07T20:33:05.3310135Z     contiguous=True,
2025-05-07T20:33:05.3310218Z     compiled=True,
2025-05-07T20:33:05.3310288Z )
2025-05-07T20:33:05.3310512Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3310681Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.3310686Z 
2025-05-07T20:33:05.3310760Z     @given(
2025-05-07T20:33:05.3310881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3310979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3311098Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3311215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3311329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3311408Z     )
2025-05-07T20:33:05.3311659Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3311751Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3311828Z         self,
2025-05-07T20:33:05.3311903Z         T: int,
2025-05-07T20:33:05.3311978Z         D: int,
2025-05-07T20:33:05.3312079Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3312167Z         contiguous: bool,
2025-05-07T20:33:05.3312251Z         compiled: bool,
2025-05-07T20:33:05.3312329Z     ) -> None:
2025-05-07T20:33:05.3312421Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3312497Z     
2025-05-07T20:33:05.3312664Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3312739Z     
2025-05-07T20:33:05.3312833Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3312957Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3313051Z         x = x_sign * x_clamp
2025-05-07T20:33:05.3313137Z         x0 = x[:, :D]
2025-05-07T20:33:05.3313218Z         x1 = x[:, D:]
2025-05-07T20:33:05.3313289Z     
2025-05-07T20:33:05.3313900Z         if contiguous:
2025-05-07T20:33:05.3314012Z             x0 = x0.contiguous()
2025-05-07T20:33:05.3314100Z             x1 = x1.contiguous()
2025-05-07T20:33:05.3314173Z     
2025-05-07T20:33:05.3314264Z         if scale_ub is not None:
2025-05-07T20:33:05.3314375Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:05.3314516Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:05.3314590Z             )
2025-05-07T20:33:05.3314668Z         else:
2025-05-07T20:33:05.3314763Z             scale_ub_tensor = None
2025-05-07T20:33:05.3314833Z     
2025-05-07T20:33:05.3314969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:05.3315194Z             op = silu_mul_quant
2025-05-07T20:33:05.3315283Z             if compiled:
2025-05-07T20:33:05.3315445Z                 op = torch.compile(op)
2025-05-07T20:33:05.3315554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3315626Z     
2025-05-07T20:33:05.3315724Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:05.3315730Z 
2025-05-07T20:33:05.3315827Z moe/activation_test.py:117: 
2025-05-07T20:33:05.3315960Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3316060Z moe/activation_test.py:115: in fn
2025-05-07T20:33:05.3316160Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:05.3316544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:05.3316637Z     return fn(*args, **kwargs)
2025-05-07T20:33:05.3317142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:05.3317247Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:05.3317733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:05.3317966Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:05.3318314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:05.3318407Z     kernel = self.compile(
2025-05-07T20:33:05.3318802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:05.3318979Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:05.3319105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:05.3319113Z 
2025-05-07T20:33:05.3319320Z self = <triton.compiler.compiler.ASTSource object at 0x7ff78118e1d0>
2025-05-07T20:33:05.3320201Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:05.3320732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7ff7ee577ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7ff5cf9f7b00>}
2025-05-07T20:33:05.3321500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:05.3321698Z context = <triton._C.libtriton.ir.context object at 0x7ff5cf633fb0>
2025-05-07T20:33:05.3321702Z 
2025-05-07T20:33:05.3321870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:05.3322143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:05.3322258Z                            module_map=module_map)
2025-05-07T20:33:05.3322423Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:05.3322526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:05.3322601Z E       ^
2025-05-07T20:33:05.3322963Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:05.3322968Z 
2025-05-07T20:33:05.3323394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:05.3323399Z 
2025-05-07T20:33:05.3323501Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3323727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3323805Z     T=128,
2025-05-07T20:33:05.3323879Z     D=7168,
2025-05-07T20:33:05.3323964Z     scale_ub=1200.0,
2025-05-07T20:33:05.3324122Z     contiguous=True,
2025-05-07T20:33:05.3324204Z     compiled=False,
2025-05-07T20:33:05.3324281Z )
2025-05-07T20:33:05.3324540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3324718Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:05.3324723Z 
2025-05-07T20:33:05.3324801Z     @given(
2025-05-07T20:33:05.3324919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3325020Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3325138Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3325252Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3325367Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3325439Z     )
2025-05-07T20:33:05.3325690Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3325787Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3325867Z         self,
2025-05-07T20:33:05.3325940Z         T: int,
2025-05-07T20:33:05.3326018Z         D: int,
2025-05-07T20:33:05.3326166Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3326256Z         contiguous: bool,
2025-05-07T20:33:05.3326385Z         compiled: bool,
2025-05-07T20:33:05.3326463Z     ) -> None:
2025-05-07T20:33:05.3326556Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3326630Z     
2025-05-07T20:33:05.3326799Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3326875Z     
2025-05-07T20:33:05.3326964Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3327087Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3328932Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3328944Z 
2025-05-07T20:33:05.3329062Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.3329066Z 
2025-05-07T20:33:05.3329171Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3329397Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3329473Z     T=128,
2025-05-07T20:33:05.3329550Z     D=5120,
2025-05-07T20:33:05.3329632Z     scale_ub=1200.0,
2025-05-07T20:33:05.3329715Z     contiguous=True,
2025-05-07T20:33:05.3329800Z     compiled=True,
2025-05-07T20:33:05.3329872Z )
2025-05-07T20:33:05.3330096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3330265Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:05.3330272Z 
2025-05-07T20:33:05.3330346Z     @given(
2025-05-07T20:33:05.3330470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3330572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3330686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3330806Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3330917Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3330988Z     )
2025-05-07T20:33:05.3331238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3331330Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3331406Z         self,
2025-05-07T20:33:05.3331482Z         T: int,
2025-05-07T20:33:05.3331556Z         D: int,
2025-05-07T20:33:05.3331659Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3331747Z         contiguous: bool,
2025-05-07T20:33:05.3331830Z         compiled: bool,
2025-05-07T20:33:05.3331957Z     ) -> None:
2025-05-07T20:33:05.3332051Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3332126Z     
2025-05-07T20:33:05.3332338Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3332415Z     
2025-05-07T20:33:05.3332507Z         x_sign = torch.sign(x)
2025-05-07T20:33:05.3332633Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:05.3334451Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3334459Z 
2025-05-07T20:33:05.3334581Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:05.3334624Z 
2025-05-07T20:33:05.3334727Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:05.3334993Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:05.3335074Z     T=128,
2025-05-07T20:33:05.3335149Z     D=7168,
2025-05-07T20:33:05.3335232Z     scale_ub=None,
2025-05-07T20:33:05.3335318Z     contiguous=True,
2025-05-07T20:33:05.3335400Z     compiled=True,
2025-05-07T20:33:05.3335476Z )
2025-05-07T20:33:05.3335696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:05.3335864Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:05.3335871Z 
2025-05-07T20:33:05.3335946Z     @given(
2025-05-07T20:33:05.3336064Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:05.3336167Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:05.3336282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:05.3336398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:05.3336515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:05.3336591Z     )
2025-05-07T20:33:05.3336844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:05.3336940Z     def test_silu_mul_quant(
2025-05-07T20:33:05.3337014Z         self,
2025-05-07T20:33:05.3337089Z         T: int,
2025-05-07T20:33:05.3337166Z         D: int,
2025-05-07T20:33:05.3337263Z         scale_ub: Optional[float],
2025-05-07T20:33:05.3337355Z         contiguous: bool,
2025-05-07T20:33:05.3337439Z         compiled: bool,
2025-05-07T20:33:05.3337515Z     ) -> None:
2025-05-07T20:33:05.3337610Z         torch.manual_seed(2025)
2025-05-07T20:33:05.3337680Z     
2025-05-07T20:33:05.3337849Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:05.3339680Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:05.3339691Z 
2025-05-07T20:33:05.3339808Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:05.3339946Z =============================== warnings summary ===============================
2025-05-07T20:33:05.3340260Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:05.3340570Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:05.3340925Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:05.3341870Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:05.3342108Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:05.3342112Z 
2025-05-07T20:33:05.3342326Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:05.3342497Z ================= 1 failed, 1 deselected, 3 warnings in 12.08s =================
2025-05-07T20:33:06.9302879Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:06.9930745Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:06.9930998Z 
2025-05-07T20:33:06.9931479Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:06.9932171Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:06.9932609Z 
2025-05-07T20:33:06.9932613Z 
2025-05-07T20:33:06.9932617Z 
2025-05-07T20:33:06.9948006Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:07.0029463Z Post job cleanup.
2025-05-07T20:33:07.1011702Z [command]/usr/bin/git version
2025-05-07T20:33:07.1055358Z git version 2.47.1
2025-05-07T20:33:07.1093527Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/558b33c6-0bf6-4269-b53e-73cbc3faf9f6/.gitconfig'
2025-05-07T20:33:07.1103994Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/558b33c6-0bf6-4269-b53e-73cbc3faf9f6' before making global git config changes
2025-05-07T20:33:07.1104911Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:07.1122136Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:07.1168299Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:07.1203249Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:07.1537911Z Entering 'external/asmjit'
2025-05-07T20:33:07.1605774Z Entering 'external/composable_kernel'
2025-05-07T20:33:07.1678712Z Entering 'external/cpuinfo'
2025-05-07T20:33:07.1746011Z Entering 'external/cutlass'
2025-05-07T20:33:07.1823081Z Entering 'external/googletest'
2025-05-07T20:33:07.1889815Z Entering 'external/hipify_torch'
2025-05-07T20:33:07.1957067Z Entering 'external/json'
2025-05-07T20:33:07.2042541Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:07.2067767Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2082374Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:07.2115151Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:07.2446447Z Entering 'external/asmjit'
2025-05-07T20:33:07.2489429Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2532463Z Entering 'external/composable_kernel'
2025-05-07T20:33:07.2574780Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2624909Z Entering 'external/cpuinfo'
2025-05-07T20:33:07.2667303Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2710138Z Entering 'external/cutlass'
2025-05-07T20:33:07.2754777Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2805465Z Entering 'external/googletest'
2025-05-07T20:33:07.2848725Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2890798Z Entering 'external/hipify_torch'
2025-05-07T20:33:07.2934574Z http.https://github.com/.extraheader
2025-05-07T20:33:07.2978558Z Entering 'external/json'
2025-05-07T20:33:07.3021625Z http.https://github.com/.extraheader
2025-05-07T20:33:07.3177212Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:07.3211313Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:07.3222811Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:07.3223191Z ##[endgroup]
2025-05-07T20:33:07.3322756Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:18.1192327Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:33:35.4262467Z Cleaning up orphan processes